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Abstract 



Compressed sensing posits that, within limits, one can undersample a sparse signal and yet 
reconstruct it accurately. Knowing the precise limits to such undersampling is important both 
for theory and practice. 

We present a formula precisely delineating the allowable degree of of undersampling of gener- 
r alized sparse objects. The formula applies to Approximate Message Passing (AMP) algorithms 

for compressed sensing, which are here generalized to employ denoising operators besides the 
traditional scalar shrinkers (soft thresholding, positive soft thresholding and capping). This 
O paper gives several examples including scalar shrinkers not derivable from convex optimization 

- the firm shrinkage nonlincarity and the minim,ax nonlincarity and also nonscalar denoiscrs 
^-H - block thresholding (both block soft and block James-Stein), monotone regression, and total 

^ variation minimization. 

Let the variables e = k/N and 5 = n/N denote the generalized sparsity and undersampling 
fractions for sampling the fc-generalized-sparse A^- vector xo according to y = Axq. Here A is an 
n X N measurement matrix whose entries are iid standard Gaussian. The formula states that 
the phase transition curve 5 = 6{e) separating successful from unsuccessful reconstruction of Xq 
^ by AMP is given by: 

^ S = M(£|Denoiser), 

7^ where M(e|Denoiser) denotes the per-coordinate minimax mean squared error (MSE) of the 

specified, optimally-tuned dcnoiser in the directly observed problem y = x + z. In short, the 
phase transition of a noiseless undersampling problem is identical to the minimax MSE in a 
denoising problem. We derive this formula from State Evolution and we present numerical 
results validating the formula in a wide range of settings. 

The above formula generates numerous insights, including: (a) in the scalar sparsity case, 
that £i minimization is very close to optimal among all possible sparsity seeking algorithms, 
including those deriving from nonconvcx optimization; (6) in the block sparsity case, block soft 
thresholding is dramatically outperformed by James-Stein block thresholding for large block- 
sizes; block James-Stein recovers block sparse signals for any 5 > s, provided the blocklength 
is above a finite value {S, e) . (c) in the nonscparablc cases of monotone regression and TV 
denoising, the sparsity-undersampling phase transition obtained by AMP tailored to the gen- 
eralized sparsity is definitely better than than the phase transition achievable using merely ii 
minimization of coeflicients in the Haar basis. 
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1 Introduction 



In the noiseless compressed sensing problem, we are given a collection of linear measurements of 
an unknown vector xq: 

y = Axo. (1.1) 

Here the measurement matrix ^ is n by A^, n < A^, and the A^-vector xq is the object we wish to 
recover. Both y and A are known, while xq is unknown; we seek to recover an approximation to 

Xq. 

Since n < N, the equations are underdetermined; it seems hopeless to recover xq in general, but 
in compressed sensing one also assumes that the object is sparse in the appropriate sense. Suppose 
that the object is known to be k-sparse, i.e. to have k nonzero entries. If the problem size {k, n, N) 
is large, many recovery algorithms exhibit the phenomenon of phase transition. 

Let e = k/N and 6 = n/N denote the sparsity and undersampling parameters, respectively. 
Hence (e,5) S [0,1]^ defines a phase space for the different kinds of limiting situations we may 
encounter as {k,n, N) grow large. For a variety of algorithms and Gaussian matrices A with iid 
entries, one finds that this phase space can be partitioned into two phases: "success" and "failure". 
Namely, for a given algorithm A and given sparsity fraction e, there exists a critical fraction 6{e, A) 
such that if the sampling rate 6 is larger than the critical value, 6 > 6{e\A), then the algorithm 
is successful in recovering the underlying object xq with high probabilit}Q while if 5 < the 
algorithm is unsuccessful, also with high probability. Here, in the interesting cases, (5(e|^) < 1, 
which means that it is indeed possible to undersample and still recover the underlying object; 
in fact shows precisely the limits of allowable undersampling. By now a large amount 

of empirical and theoretical knowledge has been compiled about the phase transitions exhibited 
by different algorithms; we refer the reader to [DTincl IDonOBl IDTOQal IXHIOI IBCITIIL IStoOQL 
IKWT09I IDMM09I IDMMIH IWai09j . In a parallel line of work, a number of sufficient conditions 
under which undersampling is possible using deterministic matrices have been studied, see e.g. 
[CT051 [BRT091 lBGI+081 ICanOGj . 

It is however fair to say that the research focused so far on 'unstructured' notions of spar- 



sity whereby k simply counts the number on non-zero entries in xq. (We refer to Section 1.5 for 
an overview of related literature.) On the other hand, applications naturally lead to 'structured' 
notions of sparsity. This paper applies an algorithm framework - Approximate Message Passing 
(AMP) - to derive specific algorithms applicable to a variety of compressed sensing settings, in- 
cluding block and structured sparsity, convex and nonconvex penalization, and develops a single 
unifying formula that, specialized to each instance, gives the actual phase transition that we observe 
in practice. To give a preview of our results, we first recall some facts about AMP and statistical 
decision theory. 

1.1 AMP Framework 

Suppose X is an unknown signal vector in that is of interest to us, and that we observe 
Y = X + Z, where Z € N(0, ct'^In) is white Gaussian noise (here and below 1^ denotes the m x m 
identity matrix). In other words, we have direct but noisy measurement of X (in contrast to Eq. 



(1.1) above, where we had indirect incomplete but noiseless observations). Let rj{-;T,a) : R i— ?■ 



^Throughout the paper, we will write that an event holds with high probability (w.h.p.) if its probability converges 
to 1 in the large system limit N,n ^ oo with 6 = n/N and e = k/N fixed. 
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be a denoising procedure, designed to recover objects X typical in our application. In other 
words, if the observation takes value Y = y, then we estimate the signal as r]{y; r, a). The denoiser 
r] depends parametrically on the noise level a, and on r that denotes one or more tuning parameters, 
to be optimized over. 

A simple example is soft thresholding. Define the scalar nonlinearity : R i— t- R where 

^A°^*(y) = sgn(y)(|y| — A)+. Then, for y G R^, we define the denoising operator by applying the 
same nonlinearity componentwise, with A = ra: ri{y;T,a) = {{'riT°J* {yi))f=i- This is appropriate to 
recover objects known to be sparse, and will be used as a running example in what follows. 

Approximate Message Passing is an iterative scheme that takes a simple denoising procedure 
and turns it into a scheme for solving the compressed sensing problem. Working now in the setting 



of Eq. ( 1.1 ), it starts from = 0, and proceeds for iterations t = 1, 2, ... by maintaining a current 
reconstruction x* and a current working residual z^, and adjusting these iteratively. At iteration t, 
it forms a vector of current pseudo-data = x^ + A*z^ and the next iteration's estimate is obtained 
by applying rj to the current pseudo-data: 

= x^ + A*z\ (1.2) 
= r]{y';r,at), (1.3) 
= y-(Ax*+i) + a/. (1.4) 

Here at is the current noise level, and is a scalar. The computation of at and ilt is described 
below. 

Conceptually, AMP constructs an artificial denoising problem at each iteration and solves it 
using a standard denoising procedure. Thus it solves a compressed sensing problem by successive 
denoising. 

For now, this description should be sufficient, save for two remarks: 

1. The specifics of the construction are absolutely crucial for the results of this paper. These 



are embedded in the specification of the scale factors Qt and at] see Section 6.1 



The above algorithm framework was originally derived in |DMM09t IDMMIO] in the case 
of a separable denoiser ry, i.e. a denoiser acting independently on each coordinate. In that 
paper the derivation was based upon constructing a proper belief propagation message passing 
algorithm, and then obtaining the above algorithm as a first-order approximation. Specific 
separable denoisers corresponded to different choices of the prior in belief propagation. 



A central point of this paper is that the form of the algorithm (1.2), (1.3), (1.4) is really more 



general and can be used in settings outside the original derivation. 
1.2 Minimax MSE for Denoising 

Again in the denoising setting, we have a random vector X G R^ of interest and observe direct 
noisy observations Y = X + Z. Without loss of generality, we can assume o"^ = 1, and denote 
the corresponding denoiser by ?]( • ; r) = r/( • ; r, 1). Assume that our denoiser is parameterized by r 
ranging over the set Q. Also assume that the underlying distribution z^^r of the vector X belongs 
the class T of probability distributions. It is convenient to evaluate the procedure ry through the 
following minimax problem 

M{J^\rj) = ^ min max E,^||X - r?(Y;T)||^. (1.5) 
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In words, we tune the denoiser optimally to control the (per-coordinate) mean-squared error for 
typical signals from even the most unfavorable choice within our class 

In our running example, the denoiser is coordinatewise soft thresholding, and the class J-n,s is 
the class of probability distributions supported on vectors X E with at most Ne nonzeros. Let 
-^(-^Af.el^) denote the corresponding minimax MSE. It turns out that M(J^jv,e|^) can be character- 
ized in terms of a scalar estimation problem. Consider the collection Ti^^ of univariate probability 
distributions which place a fraction 1 — e of their probability at zero and a fraction e at another 
distribution. Let X denote a scalar random variable with distribution u G J^i,^, and letY = X + Z, 
where Z ~ N(0, 1). Define minimax soft thresholding MSE is: 

Mi(e|Soft) = inf sup E^{{X - vT^\Y)?}- 

One can show that M{FN,e\'n) = Mi (e| Soft). The function e i— )• Mi (e| Soft) has been characterized 
in [DJ94] . 

In several other simple examples M{T\ini) can be nicely evaluated: 

• The positive-constrained case, where X > 0. In this case r] is non-negative soft thresholding, 
and T = J^N,e,+ is the class of probability distributions supported on vectors X S that 
are nonnegative and A^e-sparse. 

• The box-constrained case where X G [0,1]^, and we use capping r]{x) = min(l, max(x, 0)) 
coordinatewise. Here = J-N,e,ci is the class of probability distributions supported on vectors 
X G [0, 1]^ that are at the bounds and 1 except in a fraction s of coordinates. 

In these cases M{T\r]) can be reduced to the calculation of quantities Mi(e|SoftPos), Mi(e|Cap) 
defined in complete analogy to Mi (e| Soft). In this paper we will give several other calculations of 
M{J^\rj), going considerably beyond these examples. 

1.3 Phase Transition for AMP 

Our principal result is the following: 

Phase Transition Formula for AMP. First, consider the denoising problem for 
denoiser rj, and denote by M(J^e|?7) = M(e|r/) the corresponding minimax MSE for 
classes of signals in T^. 

Next, consider the noiseless compressed sensing setting, with signal xq sampled from 
the class J>, and reconstruction through AMP based on denoiser rj. In the large-system 
limit A^, n — 7- oo with n/N = 5, AMP will correctly recover the underlying object for 
pairs (e, 5) in the sparsity/undersampling phase diagram provided 

6 > M{e\r]). (1.6) 

Viceversa, for 6 < M{e\r]), recovery outside this region is not guaranteed, in the sense 
that, for objects sampled from the least-favorable distribution for M(e|r/), AMP will 
not correctly recover the signal. 

In some cases the per-coordinate mean-squared error is instead defined as a limit nmjv->ooiV"^E|lX-7](Y;r)|li 
across a sequence of problems of increasing size. 
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This formula encompasses the foUowing known results |DMM09] : 

• AMP soft thresholding (with optimally chosen tuning parameter) correctly recovers for 6 > 
Mi(e|Soft). 

• AMP positive soft thresholding (with optimally chosen tuning parameter) correctly recovers 
for 6 > Mi(e|SoftPos). 

• AMP capping correctly recovers for 6 > Mi(e|Cap). 



Appendix A spells out how these existing results fall under the aegis of Eq. (1.6) 



1.4 This Paper 



Our aim in this paper is to show that formula (1.6) is correct in settings extending far beyond 



the three examples just mentioned. In brief, we lay out several denoising problems, and in each 
one verify the general formula. This requires in each case (a) calculating the minimax MSE for a 
problem of statistical decision theory; (6) implementing AMP for compressed sensing with the given 
optimal denoising family; and (c) verifying empirically that phase transition does indeed occur at 
the precise sparsity/undersampling tradeoff indicated by the formula. 
In following sections, we consider the following different denoisers. 

Firm Shrinkage. Here the denoiser r] applies coordinatewise shrinkage using the so-called firm 
shrinkage nonlinearity rjl^^^ . This is a piecewise linear function of a scalar variable with 
two tuning parameters A = (Ai,A2). It thresholds the argument y to zero for \y\ < Ai, 
i.e. r]\^^{y) = in this case, and it behaves as the identity r?{*''™(y) = y for |y| > A2. 
Between the two regimes, it interpolates linearly. We show that minimax MSE function 
Ml (e I Firm) < Mi (e| Soft) strictly, and by verifying the general formula, we show that the 
phase transition curve for optimally-tuned AMP firm shrinkage is very slightly better than 
the phase transition for optimally tuned AMP soft shrinkage. 

Minimax Shrinkage. Here the denoiser applies coordinatewise shrinkage using a minimax 
shrinkage nonlinearity, hence implicitly we are optimizing the mean square error over r/ G 
{ all scalar nonlinearities }. We calculate the minimax MSE function Mi (e| Minimax), show 
that Ml (e| Minimax) < Mi (e| Firm) strictly, and verify that the phase transition curve for 
AMP minimax shrinkage is slightly better than the phase transition for both AMP soft or 
firm shrinkage. 

Block Thresholding. Here the denoiser 77 applies blockwise shrinkage using either block soft 
thresholding (for blocklength the i?-variate nonlinearity obeys riB,\{y) = y • (1 — ||y||2/A)+) 
or block James-Stein Shrinkage (we refer to Section [s] for an explicit definition) . We will 
compute the minimax MSE function MB(e|BlockSoft), and bound the minimax MSE function 
MB(e| JamesStein). We will verify that the phase transition curve for optimally-tuned AMP 
block shrinkage follows the general formula. 

Notice that, as demonstrated numerically in jDMM09j . and proved in |BMllbj in the case of Gaus- 
sian sensing matrices, soft-thresholding AMP reconstruction coincides with LASSO reconstruction 
(in the large system limit). By the above results firm-shrinkage AMP and minimax AMP both 
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outperform LASSO reconstruction. Correspondingly, it can be argued that blocksoft thresholding 
AMP coincides asymptotically with group LASSO, and hence James-Stein AMP outperforms the 
latter. 

In all of the above examples, the denoisers are coordinatewise or at least blockwise separable, 
so there is minimal "structure" to the sparsity. We next consider examples where the denoiser has 



more subtle structure. We find that formula (1.6) applies more generally, once we generalize the 
notion of sparsity. 

Monotone Regression. Here r] finds the least-squares projection of Y onto the cone of monotone 
increasing functions. As "sparse" sequences, we consider monotone signals with at most A^e 
points of increase. 

Total variation minimization. Here r] minimizes the residual sum of squares penalized by r 
times the total variation of the signal. As "sparse" sequences we consider signals with at 
most Ne points of change. 

In both cases, we find well-defined phase transitions, precisely at the location predicted by the 



general formula (1.6) 



These results have further implications, because AMP-based results are known to connect to 
important phase transitions studied previously by other techniques. 

• Phase transitions in high- dimensional geometry. Phase transitions in combinatorial geome- 
try, such as the weak neighborliness phase transition for random projections of the Platonic 
solids |DT10aj . have been shown to be identical to the phase transitions of AMP in certain 
"corresponding" problems [D MM09] . Hence, the general AMP phase transition formulas, 
which in the monotone and total variation cases refer to specific polytopes, generate fresh 
information, previously unknown in combinatorial geometry. 

• Phase transitions in convex optimization. We show here that phase transitions of several 
convex optimization procedures in compressed sensing agree to high accuracy with formula 



(1.6). Indeed this agreement was proved rigorously in the case of standard £i -regularized 
regression [BMllbj . As a result we have interesting new formulas for phase transitions in 
compressed sensing with structured sparsity. 

Phase transitions in non-convex optimization. We study here two AMP algorithms which 



are not equivalent to any convex optimization procedure, and show that (1.6) describes their 
phase transition curves, which are better than the phase transition curves of the more popular 
convex optimization procedures. 



Finally, one may ask: why should (1.6) be true? We show here that it can be derived from the 



validity of state evolution for AMP algorithms in general. In other words, when state evolution 



accurately describes the behavior of AMP algorithms, formula (1.6) can be expected to apply. In 
the case of separable denoisers, the correctness of state evolution follows from the general result 
of |BMlla| (the latter requires the denoiser to be locally Lipchitz continuous, a condition that is 
easily checked for all the examples considered here) . This result can be further generalized to block- 



separable denoisers [Ranll^ lBLM12j . The much broader validity of formula (1.6) we document in 



this paper suggests that state evolution is valid in a much broader setting than previously claimed. 
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1.5 Related Literature 



Approximate message passing algorithms for compressed sensing reconstruction were introduced 
in |DMM09] . They were largely motivated by the connection with message passing algorithms in 
iterative decoding systems [RU08] , and with mean field methods in statistical physics [MM09] (in 
particular the cavity method and TAP equations). We refer to [DMMli] for a discussion of these 
connections. 

The original AMP framework |DMM09t IDMMIO] included iterations of the form defined in 



Eqs. (1.2), (1.3), (1.4) whereby the denoiser is separable and identical across coordinates, namely 

Vt{v) = {rit{vi),r]t{v2),...,'nt{vN)) ■ (1-7) 

While this covers the Firm and Minimax shrinkage rules studied in this paper, it did not include the 
various non-separable denoisers we discuss below, namely the block, monotone and total variation 
denoisers. Further, in |DMM09] . the phase transition behavior was validated numerically only for 
Soft, SoftPos and Cap denoisers, that are in correspondence with well studied convex optimization 
problems. The extension to a noisy linear model y = Axq + w, with w G R" a vector of iid random 
entries was carried out in [DMMli] . We also refer to |Monl2j for an overview of this work. 

Several papers investigate generalizations of the original framework put forward in |DMM09] . 
The paper [BMlla] defines a general class of approximate message passing algorithms for which 
the state evolution was proved to be correct. This included in particular all separable Lipschitz- 
continuous denoisers. 

Rangan |Ranll| introduces a class of generalized approximate message passing (G-AMP) al- 
gorithms that cope with -roughly- two generalizations of the basic noisy linear model. First, the 
noisy measurement vector y can be a non-linear (random) function of the noiseless measurement 
Axq. Second, each of the 'coordinates' of xq can itself be a -low dimensional- vector. Interesting 
applications of this framework were developed in [KGRllt ISchll| . Let us notice that G-AMP does 
not cover any of the non-separable cases treated here (even the block sparse example), and hence 
provides a generalization in an 'orthogonal' direction. 

In a parallel line of work, Schniter applied AMP to a number of examples in which the signal 
xq has a structured prior [SchlOl ISSSlOl ISPSIO] . Inference with respect to the prior is carried out 
using belief propagation, and this is combined with AMP to compute a posteriori estimates. This 
type of application fits within the class of problems studied here, by chosing the denoiser rjt in 



Eq. (1.3) be given by the appropriate conditional expectation with respect to the signal prior. On 



the other hand, the general scheme provided by Eqs. (1.2), (1.3), (1.4) encompasses cases in which 
the denoiser is not derived from a specific prior. 

Maleki et al. [MAYBlT] used methods analogous to the ones developed here to study phase 
transitions for compressed sensing with complex vectors. This is closely related to the block- 
separable setting considered in Section [3] (there is however some difference in the structure of the 
sensing matrix). 

Structured sparsity models are studied from a different point of view in |BCDH10l ICHDB081 
ICICBIO] . Those works focus on deriving sparsity models that capture a variety applications, and 
of convex relaxations that promote the relevant sparsity patters. Reconstruction guarantees are 
proved under suitable 'isometry' assumptions on the sensing matrix. 

Closer to our approach is a recent series of papers [CRPWlTl IRRNIH IRRNllj , considering 
general classes of structured signals under random measurements. Let us emphasize two important 
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differences with respect to our work. First, these papers only deal with convex reconstruction 
methods, while we shall analyze several nonconvex approaches and demonstrate improvements. 
Second, they derive reconstruction guarantees using concentration-of-measure arguments, while we 
derive exact asymptotics (essentially based on weak convergence), which enables us to unveil the 
key relation (1.6) between denoising and the compressed sensing phase transition. 



2 Scalar- Separable Denoisers 

In this section we study scalar-separable denoisers, i.e. denoisers that take the form 

Vt{v) = {vt(.vi),rit{v2), . . .,'nt{vN)) , (2.1) 

where we assume that the denoiser is symmetric across coordinates and, with a slight abuse of 
notation, we used the same symbol for the overall vector denoiser and for the scalar denoiser 
?7t : R I— R. Often (but not always) r/t obeys ||r/f(u)||2 < \\v\\2, so one commonly encounters the 
term 'shrinker' rather than 'denoiser'. 



2.1 Basic Simplifications 

All of our scalar denoisers r]^ : R — )• R take the form r]t{Y) = ri(Y; r, at). Here r E denotes one 
or more tuning parameters that depend on the signal class but not on the iteration. Also at is the 
effective noise level in observation Y. Under the state evolution formalism described in Section 4, 
at iteration t, in coordinate i, we have Y = yt^i ~ xo,i + at Zi with Zi ~ N(0, 1). 

In fact, the scalar denoiser will be specified only for at = 1, and extended through the scaling 
relation 

r]{y;T,at) = atrji—^T,!) . (2.2) 
\at ) 

Notice that this choice is justified by the fact that the signal classes under consideration are invariant 
under rescaling. We will use the simpler notation r\{y\ r) = r/(t/; r, 1) for the denoiser at noise level 



one. This section develops the machinery needed to apply formula (1.6) to the case of scalar 
denoiser operators. 



2.2 Minimax MSE of a Shrinker 

Consider the scalar observation Y = X ^ Z where X is the unobserved quantity of interest, 
Z ~ N(0,1) independently of X, and Y is observed. Denote by v the distribution of X\ the 
mean-squared error (MSE) of the shrinker is defined so: 

MSE(r?(-;r);z.) =E{[r?(X+ Z;r) -X]'}. 

Again, the random variables Z ~ N(0, 1) and X ^ v are independent. 

Obviously, the performance depends on both the tuning parameter r and the distribution v of 
X. The random variables X of interest are only occasionally nonzero. Let J-\^e denote the collection 
1-dimensional marginal distributions placing mass (1 — e) at the origin: 

^i.e = {i^ : is probability measure on R with J^({0}) > 1 — e}. 
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We evaluate the performance of tuning constant r by the worst-case performance it yields over 
T^i^e, namely maxj-^ ^ MSE(7/(-; r), u). The minimax tuning parameter T*(e) solves the problem: 

M(e|r/) = M(J^ie|7?) = inf sup MSE(r?( • ; r), z^) . (2.3) 

res ueTi^^ 

The minimax shrinkage procedure (within the class of ?]( • ; r)) is denoted ??*(•) = r/( • ; r*). 

This general framework has been considered before in the context of soft thresholding ri{y; r) = 
sgn(y)(|y| — r) + , positive soft thresholding r]{y;T) = r]'^°f^P°'^{y^T) = {y — t) + , and also hard 
thresholding r]{y;T) = and the minimax MSE and minimax thresholds were derived and 

analyzed in [DPS92' 'DJ94j. Plots of the minimax soft threshold T*(e) and the minimax MSE 
are available in [DMM091 IDMMll] . Such plots also appear later in this paper as baselines for 
comparison of interesting new families, namely Firm Shrinkage and Minimax Shrinkage. 

2.3 Firm Shrinkage 

A frequently-voiced criticism of ii minimization and soft thresholding is the tendency to shrink 
large values by more than warranted. In the mid 1990's, firm shrinkage was introduced to correct 
this tendency by Gao and Bruce [GB97j : the name reminds us that it is intermediate between soft 
and hard thresholding; continuous like soft, but not shrinking large values - just like hard. The 
firm nonlinearity r/^*^™"- ; ri, T2) (denoted above by ??{*''™') has two threshold parameters r = (ti, T2), 
whereby < ri < r2. Below n the output is zero. Above T2 the output is the same as the input. 
Between these two thresholds, the output is determined by linear interpolation of the values at the 
thresholds. Formally: 

{0 \y\ < Ti, 

sgn(y)(|y| - ri)T2/(T2 - n) n < \y\ < T2, 
y \y\ > T2- 

The family ri-^^^"^{ ■ ;ti,T2) contains soft and hard thresholding as limiting cases: 

r,^-f\y;n)= lim r?^-™(y; n, r2), r?'^-^(y; n) = lim r?/-™(y; n, T2). (2.4) 

In this setting the general prescription for minimax MSE reduces to: 

M(e|Firm) = M(^i^|r/-^*^™) = inf sup MSE(r/-^*^™( • ; n, rs), i/) . (2.5) 

Implicit in this calculation is determination of the optimal thresholds Tj*(e) and T|(e). One can 
show that < M(e|Firm) < 1, that M is monotone increasing with e, that M(e) — )• 1 as e — )• 1 
and that M{e) — ;> as e — )- 0. 

Figure [T] shows the minimax MSE for firm shrinkage. The figure also shows similar results for 
soft and hard thresholding, for comparative purposes. Over the range presented, the minimax MSE 
for firm thresholding is strictly smaller than the MSE for hard or soft thresholding. Namely, over 
this range of e, 

M(e|Firm) < M(e|Soft) < M(e|Hard) . 

This in a sense validates certain traditional criticisms of soft thresholding, which is often said to 
shrink large values "too heavily"]^ 

^Note, however, that the use of hard thresholding instead of soft thresholding makes things worse! 
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£ 


M(e Hard) 


M(e Soft) 


M(e Firm) 


M(e Minimax) 


0.010 


0.0729 


0.0612 


0.0552 


0.0533 


0.025 


0.1547 


0.1231 


0.1137 


0.1093 


0.050 


0.2676 


0.2039 


0.1921 


0.1841 


0.100 


0.4497 


0.3288 


0.3165 


0.3025 


0.150 


0.5960 


0.4279 


0.4171 


0.3983 


0.200 


0.7161 


0.5111 


0.5024 


0.4802 


0.250 


0.8141 


0.5829 


0.5763 


0.5516 



Table 1: Minimax MSE of various scalar nonlinearities, across sparsity levels 



Figure 2.3 shows the minimax thresholds. At least for e < 1/3 we see clearly that Tj*(e) < 
r|(e) < oo, so firm thresholding is preferred over the limiting cases of hard and soft thresholding |^ 
Figure [2?3| shows the corresponding minimax denoisers for specific values of e. 



2.4 Minimax Shrinkage 

The previous example showed that a parametric family of shrinkers can improve on soft threshold- 



ing, and hence improve the predicted phase transition according to ( 1.6 ). The ultimate improvement 
one could make in this direction is to use the globally minimax nonlinear shrinker. This is the non- 
linearity rj that is minimax not within some parametric family, such as the soft thresholding or the 
broader firm thresholding family, but minimax over all measurable nonlinearities r; : R — t- R. For- 
mally, let i2(R) be the set of all measurable functions r : R — t- R, for such a r, set r/"''(x; r) = t(x). 
The minimax MSE over this class is 

M(e|r/''") = inf sup MSE(r/''"( • ; r), z^) 
re£(R) ueTi,, 

= inf sup MSE(?7, u) 

rjeC(R.) v&Ti,e 

= M(e|Minimax) . (2.6) 

The calculation of this quantity uses a variety of ideas from minimax decision theory, for example in 
|Bic81l ICSSTl IBC83[ IDDS92[ IJoh94bl IJoh94al lDJ94 j ; details are given in Appendix [B| A key point 
of this derivation is the characterization of the minimax nonlinearity as the MMSE Bayes rule (that 
is, the conditional expectation) for the so-called least-favorable prior. The least-favorable prior is 
the solution of Mallows' classical Fisher information problem |Mal78j . solved here numerically. 

Table [l| Figure [T] present numerical values associated with the solution of the minimax problem. 
Inevitably, M(e|Minimax) < M(e|Firm) < M(e|Soft), i.e. optimizing over all nonlinearities yields 
a smaller mean square error than soft or firm thresholding. On the other hand. Table [l] shows that 
the improvements are typically of size 0.01 or smaller over the range e G (0.01, 0.25). For very small 
£ it was pointed out in |DMM09j that |DJ94j showed 

M (el Minimax) 

lim — ^ — = 1. 

6^0 M(e|Soft) 

*These are numerical results. Open question: for e larger than the range studied here, does the minimax firm 
threshold have parameter r2(e) — oo reducing it to soft thresholding. 



12 




t 



Figure 1: Minimax MSE of various scalar nonlinearities, as a function of sparsity parameter e. 
Minimax MSE of firm thresholding is very close to that of soft thresholding for e > 1/3. 




Figure 2: Minimax thresholds of various thresholding nonlinearities, as a function of sparsity e. 
The sudden drop in the value of the hard threshold to zero near 0.45 coincides with the value of 
the minimax MSE in Figure 1 reaching 1.0. The numerical approximation to the minimax lower 
firm threshold approaches the minimax soft threshold as e increases beyond 0.3, and the upper firm 
threshold increases rapidly. 
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E = 0.01 



E = 0.05 




-5 5 -5 5 

input Y input Y 

Figure 3: Minimax nonlinearities of various types, at sparsity e G {0.01,0.05,0.15,0.25}. Black: 
hard thresholding; Blue: soft; Aqua: firm; Red: minimax shrinkage. 



Least Favorabie 




-Hard 
Firm 
- IVlinimax 



0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 



Figure 4: Least Favorable /i for various thresholding nonlinearities, as a function of sparsity e. 
Note: soft thresholding has /i*(e|Soft) = ±00, not shown. 
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In the limit of extreme sparsity, there is nothing to be gained by completely general nonlinearities 
over soft thresholding! The improvement is non- vanishing, but moderate for e non-vanishing. 



2.5 Empirical Phase Transition Behavior 



The research hypothesis driving this paper is that Eq. (1.6) describes the phase transition of AMP 
algorithms. In order to be completely explicit, we need to check the following predictions, for each 
nonlinearity ry of interest 

1. There exists a curve e i— )• 5(e|?/) such that for 5 > Si{£\r]) the corresponding AMP algorithm will 
typically succeed in reconstructing the unknown signal xq, and for 5 < 6{£\ri) the algorithm 
will typically fail. 

2. The curve is related to the corresponding scalar denoising problem by S{£\rj) = M(e|r/). 

We now test this hypothesis for the firm and globally minimax nonlinearities r] G {r)-^^^"^,ri°'^^}. 

Our experiment was conducted along the same lines as fM DlOl IDTOQbl IDMMOQt IDMMIH 
IBTIO] . We considered a range of problem sizes N G {1000,2000,4000} and a range of sparsity 
parameters e G {0.01,0.02,0.05, .10,0.15,0.20,0.25}. We considered a grid of S values surrounding 
the predicted phase transition d{e\r]), and ran A'gampie = 1000 Monte Carlo reconstructions at each 
parameter combination. We declared "success" when the relative mean-squared error was below 
1%: 

\\X — Xn 9 

" < 0.01. 



Ikolli 

We used t = 300 iterations of AMFB 

This experiment generated data set for each optimal nonlinearity, containing, for each algorithm 
and each fixed e, a list of values 6i and empirical success fractions pi. The success fractions observed 
at 6 > M(e|r/) were indeed typically better than 50% and at 6 < M{e\r]) were typically worse than 
50%. 

To quantify this tendency, we fit a logistic regression 

logit{pi) = a + m-S{e\v)), (2.7) 

where S{£\rj) = M(e|r/) was computed analytically using ideas mentioned earlier. For each set of 
data corresponding to given (A^, e) and each non-linearity, we estimate a and /3 from the logit fit, 
leading to values a, j3. Using these quantities, we estim ate the phase transition location as the value 



at which the probability p of success is 50%. Using Eq. (2.7) this corresponds to 3+/3((5— (5(e|7?)) = 0, 
i.e. 5 = S{e\'q) — {a/ (3). We are therefore led to define the offset between the empirical phase 
transition and the prediction S{£\ri) = M{£\r]) as 

FT = PT{N,e) = (2.8) 



In order to check the general relation provided by Eq. (1.6) we need to show that PT(A^, e) tends 
to zero as A^ gets large, to within the statistical uncertainty. In Table [2] we report our results on 
the empirical phase transition, confirming that indeed the offset is small and decreasing with A^. 
Notes: 

^In most cases the mentioned convergence criterion is reached after a much smaller number of iterations (roughly 

20) 
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• We also calculated formal 95% confidence intervals for PT, indicating the tight control we 
have of the correct value. 

• As in earlier studies |DT09b) . we don't expect the empirical PT to lie exactly at the prediction, 
but instead to obey a relation of the form 

const 

empirical PT^ — predicted PT^_j.oq ~Ji^ — '~ ^^^P^^^S error, 

for some 7 S (0, 1]. Our data supports this relationship, with 7 1/3. See Appendix [H} 

• Denoting by Pn the fitted slope coefficient at dimension N, evidence that /3iv is increasing with 
larger indicates that a sharpening of the phase transition is indeed occurring. Appendix 
[h| shows that (3^ ~ is consistent with our data. 



3 Block-Separable Denoisers 

We now turn to the case of block-structured sparsity, first introducing some notational conventions. 
We partition the vector x = (xi, 2:2, . . . , xjy) into M blocks each of size B. Denoting by blockmix) = 
(^(m-i)_B+i) • • • ) XmB) the m-th block, we write 

X = {blocki{x), . . . , block m{x)) , 

with, of course, N = MB. 

A block-separable denoiser is a mapping rjt : — )• that decomposes according to the 
above partition: 

r]t{x) = {r]t{blocki{x)), . . . ,r]tiblockMix))) , (3.1) 

where, with an abuse of notation, we use the same symbol to denote the single-block denoiser 
r/t : R^ ^ R^. 

The same simplifications described in Section 2.1 for scalar denoisers apply to the our treatment 
of block denoisers as well. Namely, for y € R'^, we have r]t{y) = r]{y;T,at) with at the effective 
noise level at t-th iteration. The scaling relation ri{y;T,(7t) = atr]{y /at;T, 1) allow us to focus on 
the case at = l, denoted by ry(y; r) = r/(y; r, 1). 

The AMP algorithm stays the same as defined by Eqs. (1.2) to (1.4) in this block-separable 
case except for calculation of the scalar Qf This is given by 

nt = Ave™ div{r]{blockm{y^),T^-^)). 

Here Ave^ denotes uniform average over the index m, i.e. Avem,/(W') = -^"^ Si^i /('^)' ™d 
div is the divergence operator. Namely, if Jf is the Jacobian of the mapping / : R^ ^ R^, 
div(/) = Tr(J/). In the case B = 1 these formulas reduce to the ones for scalar AMP that we have 
already seen. 
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Firm Thresholding with Minimax Tuning 



e 


Pred 


off. 1000 


off. 2000 


ci.lOOO.lo 


ci.lOOO.hi 


ci.2000.1o 


ci.2000.hi 


0.01 


0.0550 


0.0057 


0.0040 


0.0054 


0.0060 


0.0038 


0.0043 


0.02 


0.0960 


0.0065 


0.0039 


0.0062 


0.0069 


0.0036 


0.0041 


0.05 


0.1920 


0.0064 


0.0047 


0.0060 


0.0068 


0.0043 


0.0050 


0.10 


0.3170 


0.0067 


0.0050 


0.0063 


0.0071 


0.0047 


0.0054 


0.15 


0.4190 


0.0070 


0.0049 


0.0065 


0.0074 


0.0045 


0.0052 


0.20 


0.5070 


0.0068 


0.0052 


0.0064 


0.0073 


0.0048 


0.0055 


0.25 


0.5840 


0.0069 


0.0055 


0.0064 


0.0073 


0.0051 


0.0058 


AMP - Minimax Nonlinear Shrinker 


e 


Pred 


off. 1000 


off. 2000 


ci.lOOO.lo 


ci.lOOO.hi 


ci.2000.1o 


ci.2000.hi 


0.01 


0.0530 


0.0075 


0.0052 


0.0072 


0.0078 


0.0050 


0.0055 


0.05 


0.1840 


0.0126 


0.0099 


0.0122 


0.0130 


0.0096 


0.0102 


0.10 


0.3020 


0.0183 


0.0164 


0.0178 


0.0188 


0.0160 


0.0167 


0.15 


0.3980 


0.0147 


0.0127 


0.0142 


0.0151 


0.0124 


0.0131 


0.20 


0.4800 


0.0148 


0.0114 


0.0143 


0.0152 


0.0110 


0.0118 


0.25 


0.5510 


0.0155 


0.0124 


0.0151 


0.0160 


0.0120 


0.0127 


Soft Thresholding with Minimax Tuning 


e 


Pred 


off. 1000 


off. 2000 


ci.lOOO.lo 


ci.lOOO.hi 


ci.2000.1o 


ci.2000.hi 


0.01 


0.0610 


0.0061 


0.0045 


0.0058 


0.0064 


0.0043 


0.0048 


0.02 


0.1040 


0.0065 


0.0052 


0.0062 


0.0069 


0.0049 


0.0055 


0.05 


0.2030 


0.0089 


0.0072 


0.0085 


0.0093 


0.0068 


0.0075 


0.15 


0.4270 


0.0117 


0.0100 


0.0112 


0.0122 


0.0096 


0.0104 


0.20 


0.5110 


0.0118 


0.0100 


0.0113 


0.0124 


0.0096 


0.0104 


0.25 


0.5830 


0.0127 


0.0106 


0.0122 


0.0132 


0.0102 


0.0110 



Table 2: Empirical Phase Transition studies for AMP algorithms based on three shrinkers. 1000 
Monte Carlo repetitions at iV = 1000, 2000. Column Prec? corresponds to general formula (1.6), and 



is thought to accurate for large . The colu mns off. 10 00 a. nd off. 2000 report values of the offset 
PT estimated by logistic regression, using Eqs. (2.7) and (2.8). (Note that offsets are systematically 
smaller at A^ = 2000 than at A^ = 1000, consistent with hypothesis that they vanish as A^ — t- oo.) 
Columns ci.lOOO.lo , ci.lOOO.hi, ci.2000.lo , ci.2000.lo give lower and upper endpoints of formal 
95% confidence intervals for the offset. 
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3.1 Block Soft Thresholding 

We first revisit the denoising problem for a single block. 

Let Y £ be a random vector obeying Y = X + Z where Z ~ N(0, cj^/b)) independent of 
X, the object of interest. The block-soft thresholding operator r]^°f^(^ • ; r) is the nonlinear shrinker 
rj^"f^(Y;T) = Y ■ {1 — j^^)^, where r > is the threshold parameter. The case B = 1 reduces 
to traditional soft thresholding |DJ94| . Block thresholding has previously been considered by 
Tony Cai |Cai99j and also Hall, Kerkyacharian and Picard |HKP98j . although in specific 'wavelet' 
applications; our point of view is somewhat different. 

Shrinkage makes sense in the situation where X is often likely to be zero or at least small. Let 
u denote a probability distribution over and consider the dimension-normalized mean-squared 
error 

MSEb(7?-/*( ■■,t),u) = 1e{ \\X - r^{Y- t)\\1] , 

where X ^ u^Y ^ X + Z . We will show that when X is block-sparse in the sense of being different 
from zero with small probability, shrinkage works well. Formally, we let J-B,e denote the collection 
of distributions u obeying 

K{^ = 0})>l-e, 
and define the minimax block soft threshold risk in dimension B by: 

MB(e|BlockSoft) = inf sup MSEb(?7"°^*( • ; r), v). 

In Figure [5] we present graphs of M{£) = MB(e|BlockSoft) as a function of e. It is possible 
to prove the following structural properties: (i) < M{e) < 1 (the upper bound follows from 
taking r = 0); (ii) M(e) is monotone increasing and concave (monotonicity is a consequence of 
^B,e ^ J^B,£' for e < e', and concavity since any measure in J'^qei+{i~q)e2,B can be written as 
convex combination of measures in J^ei,B and in Fs^^b)'-, {Hi) M{e) — ?■ as e — >• 0; M(e) — 1 as 
e — )• 1. Recall that we are considering the dimension-normalized MSE. In terms of total-MSE, the 
unnormalized mean-squared-error, the last relation would be total-MSEB(e) — )• as e — )• 1. 

Associated with the minimax problem is also an optimal threshold value t*{£\B), that we plot 



in Figure 3.1 



As S — )• 00 the minimax per-coordinate MSE has a well defined, and particularly explicit limit. 
Lemma 3.1. As B ^ 00, we have 

Mi3(e|BlockSoft) ^ Moo(e|BlockSoft) = 2e-e'^. 



Further, the dimension-normalized minimax threshold converges with increasing blocksize: t* {e\B)/\' B 
T*{e\oo) = (1 -e). 



This lemma is proved in Appendix C.l 



3.2 Block James-Stein 

Starting from the perspective of ^i-penalized estimation, and compressed sensing, block soft thresh- 



olding seems very natural. However, the limiting relationship in Lemma 3.1 shows that this ap- 



proach leaves room for improvement at large blocksizes. The most ambitious goal we could set, 
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large-B limits of MSE for James-Stein and Blocksoft Shrinkage 



James-Stein 

Block Soft 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 

E 

Figure 7: Limiting i? — )• oo per-coordinate MSE of BlockSoft and James-Stein shrinkers. The 
limiting James-Stein MSE coincides with the ideal MSE. 



let's call it the ideal MSE, is the MSE achieved by a denoiser which utilizes a special oracle that 
tells us without error which blocks are zero and which are nonzero. The resulting ideal MSE 
is Mb (e| Oracle) = e. This is considerably smaller than MB(e| BlockSoft), even in the B ^ oo 



limit characterized by Lemma 3.1 On the other hand the denoising/compressed sensing problems 
become easier as i? — t- oo. Can we hope to achieve MB(e|Oracle)? 

To approach this ideal MSE in the large-blocksize limit, we propose to use the positive-part 
James-Stein shrinkage estimator |JS10] . On each block we propose the rule 

Analogously to block soft thresholding, this estimator shrinks to blocks with small norms. How- 
ever, the details are crucially different. The limiting B ^ oo behavior is, in fact, ideal, and 



noticeably better than blocks soft shrinkage, as shown by Figure 3.2 and formally in the next 
lemma. 

Lemma 3.2. Let MB(e| JamesStein) denote the minimax MSE for -q^^ over the class of s -block 
sparse sequences. For B > 2: 

2 

MB(e| JamesStein) < e + —. 

B 

Proof. Consider temporarily the case where Y = + Z with Z ~ N(0, 7^), and ^ G R'^ nonrandom 
and known. A simple calculation shows that the optimal linear diagonal estimator rj^^^ is given 
by 

r,^^^(y) = c(^).y 
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where c(^) = 11/^112/(11/^^112 + ^)- This estimator uses information about ||/i||2 (which could only 
be supplied by an oracle) to choose the constant c as a function of ||/i||2 which minimizes the 
mean-squared error among all rules of the estimators of form c ■ Y. Note in particular that the 
per-coordinate MSE of this estimator is 

and in particular 

= MSE(7?^^^, /x) < sup MSE(??^^^, /x) = 1 . (3.2) 

Define the ideal worst-case MSE by 

MB{eW^^)= sup E^{MSE(?7^^-^,/x = X)}. 



Applying (3.2), and keeping in mind that, under v, X = with probability at least (1 — e), and a 
bound 1 on the MSE of a nonzero block, we have 

MB(e|r?'^^) =e. (3.3) 

The oracle inequality |DJ95l Theorem 5] shows that for B > 2, and for every vector /i S R^, if 
Y ~ H{h,Ib), then 

MSE(r/-^^,/x) < MSE(7?^^^,/x) + ^. 
Combined with the previous display this proves the Lemma. □ 



The argument in the proof leads in fact to a convenient expression for MB(e| JamesStein 
Letting, with an abuse of notation, MSEb(?7'^'^, /i) = MSEb(??'^'^, 5^) denote the per-coordinal 
MSE when the signal distribution is a Dirac delta v = 5^, we have 

MB(e|JamesStein) = {I - e) MSEij(7?'^'^, 0) e maxMSEB(r?-^'^, /u). 



Now T]^^ is known to be minimax for the unconstrained problem of estimating a non-sparse vector 
/i, i.e. max^ MS EBir]'^^ , IJ^) = 1 yielding 

Mij(e| JamesStein) = (1 - e) MSEBiv^^^, 0) + e . 

Therefore computing the minimax MSE for rj"^^ reduces to computing the single quantity MSE b {iT^ ^ , 0), 
that can be estimated through numerical integration. A good approximation for large B is provided 
by the following formula MSEb{v^^, 0) ~ ^-^(l + 0.752/^/5). Hence we have 

MB(e|JamesStein) ^{l-e)^^ + ^^^} + £• (3.4) 

In the next section will use this formula in comparing the general prediction of Eq. (1.6) with the 
empirical results for the James-Stein AMP algorithm. 
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B=2 B=3 B=4 




Figure 8: Results for block soft thresholding AMP with minimax threshold. Reconstruction is 
based on noiseless measurements y = Axq. Here 6 = n/N is the undersampling fraction, and 
£ = k/N is the sparsity measure. Red: less than 50% fraction of correct recovery. Green: greater 
than 50% fraction of successful recovery. Blue Curve: 6 = MB(e|BlockSoft). 



3.3 Empirical Phase Transition Behavior 

We now turn to the compressed sensing reconstruction problem whereby the block-sparse vector 
xo is reconstructed from observed data y = Axq using the AMP algorithm. We want to test the 



hypothesis that Eq. ( 1.6 ) describes the phase transition of the two block shrinkage AMP algorithms, 



corresponding to the block soft thresholding, and block James-Stein. 

We conducted a set of experiments similar to those described in Section 2^ We constructed 
block-sparse signals at different undersampling and sparsity levels and ran tests of block threshold- 
ing AMP. 

Our results show that the curve 6 = M^(e|BlockSoft) correctly separates two phases of per- 
formance: below this curve success in AMP recovery is atypical and above it is typical. Similarly, 
the curve 6 = MB(e| JamesStein) correctly describes the phase transition for block James-Stein 
shrinkage. The empirical results are presented in Figure [s] (for block soft thresholding) and Figure 
|9] (for block James-Stein). 

4 Derivation of the Phase Transition by State Evolution 



In this section we derive the basic relation (1.6) between minimax mean square error in denoising 
and the phase-transition boundary in the sparsity-undersampling plane. Our derivation is based 
on the state evolution formalism developed in |DMM09l IDMM10| IBMllal iDMMllj . 

One basic observation that is crucial for state evolution is that the mean squared error of the 
AMP reconstruction x* at iteration t is practically non-random for large system sizes A^, and has 
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B = 8 B = 12 B = 16 




0.5 1 0.5 1 0.5 1 



B = 24 B = 32 B = 48 




e E e 



Figure 9: Results for bfock James-Stein AMP. Reconstruction is based on noiseless measurements 
y = Axq. Here 6 = n/N is the undersampling fraction, and e = k/N is the sparsity measure. Red: 
less than 50% fraction of correct recovery. Green : greater than 50% fraction of successful recovery. 
Blue Curve: asymptotic (large- i?) formula for S = M5(e| JamesStein) ( |3.4[ ). Black Curve: 6 = e 
(lower bound on minimax risk of James-Stein Shrinkage). 
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a well-defined limit as — )• oo. In particular, the limit 



mt = lim ^ ||xo - x^\\l, 
N^oo iV 

exists almost surely (with n — )• oo as well, while n/N — )• 6). Moreover, the evolution of mt with 
increasing t is dictated by a formula mt+i = ^{nit) which is explicitly computable, and defined 
below. We will use the term state evolution to refer both to the mapping m i— >• ^(m) and to the 
sequence {mt}t>o with appropriate initial condition. State evolution allows to determine whether 
AMP recovers the signal xq correctly, by simply checking whether — )• as t — )• oo (in which case 
the MSE vanishes asymptotically) or not. The latter problem does in turn reduce to a calculus 
problem. 

The papers jPMMOQl IDMMIOI IDMMllj . develop the state evolution framework for scalar 
shrinkers rj : i— )• R^, and verified its predictions numerically for three specific examples (namely 
for the shrinkers Soft, SoftPos, and Cap). However, the heuristic argument presented in those 
papers was much more general. Indeed, [BMllaj proved that state evolution holds, in a precise 
asymptotic sense, for Gaussian measurement matrices A with iid entries and generic scalar shrinkers 
T] : R^ I—)- R^, under mild regularity assumptions. 

Here we generalize this approach to vector shrinkers r] : R^ i— )• R^. This framework covers 
all the shrinkers discussed in Sections [2] and [3j and yields a formal derivation of the main formula 



(1.6), under the assumption that indeed state evolution is correct in this broader context. 

Before proceeding, let us remind the reader that we focused so far on denoisers applied to data 
Y = X + Z, whereby Z ~ N(0, /g) and hence the noise level is u = 1. For the more general case 
of Y = X + fjZ where cj 7^ 1, we recall the assumption of a scaling relation 

7]{Y;T,a) = a-r]{a-'Y;T,l). (4.1) 

In the next sections we will first introduce some basic notations and facts about state evolutions. 



Then we will derive the phase transition expression (1.6) by establishing first a lower bound and 



then a matching upper bound, both given by the minimax MSE. 
4.1 State Evolution 

The next definition provides the suitable generalization of the state evolution mapping to the 
present setting. 

Definition 4.1. For given 6,t > 0, i? € N and v a probability distribution over R^, define the 
state evolution mapping : R 1— )• R by 

Here, as before, X and Z are independent B-variate vectors, X i/, Z N{0,Ib), and rj is a 



given shrinker obeying {4-1 )■ In other words, '^{m;6,T,u,B) is the per- coordinate MSE at noise 
level cr^ = m/5. 

Given implicit fixed parameters {5,t,v,B), state evolution is the one- dimensional dynamical 
system: mt+i = ^{mt) starting from tuq = B~^Kiy\\X\\2. 
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The fixed points of the mapping m i— t- ^(m) play of course a crucial role in the analysis of state 
evolution. 

Definition 4.2. The highest fixed point of the mapping ^(•) = ^( • ; (5, r, i/, i?) is defined so: 
HFP(^') = sup{m : '^{m) > m}. 

The importance and applicability this notion is underscored by the next two observations. Here 
and below we say that a function / : R-|- — )• R is starshaped if x i— t- f{x)/x is decreasing. 

Lemma 4.1. Suppose any one of these three conditions holds: 

• ^(m) is an increasing function of m with mo > HFP(^'); or 

• ^(m) is an increasing starshaped function of m. 

• HFP(^') = 0. 

Then state evolution converges to the highest fixed point: 

lim rrit = HFP(^'). 

The proof is a standard calculus exercise; we omit it. 

Lemma 4.2. The function m i— )■ ^(m) is increasing starshaped for all of the following choices of 
the denoiser T], with the scaling convention (4-1)- 

• Soft Thresholding. 

• Positive Soft Thresholding. 

• Block Soft Thresholding. 

• James-Stein Shrinkage. 

Proof. We give a proof only for Soft, Blocksoft, and James-Stein shrinkers; those denoisers are co- 
variant under rotation. Explicitly, for any orthogonal matrix Q G R'^^^, r]{Qy; r, a) = Qri{y; r, a). 
As a consequence, if D is the probability measure obtained by averaging over rotations, we have 
^(m; 6, r, i>, B) = ^{m; 6, r, i/, B). 

We can therefore, without loss of generality, consider measures v that are themselves invariant 
under rotations. Any such measure can be written as convex combination of probability measures 
UJf^ that spread their mass uniformly on the sphere of radius fi > 0. For a given fixed r, let 
R{n) = R{h;t) denote the risk function = MSE(a;^;T, 1) From the scaling relation (4.1), we 
have the identity 

il'(m) = cr'^R{fi/a), cj^ = m/6. 

We need to show that ^{m)/m is nonincreasing in m > 0. However, this is just the same as asking 
if ■ \J 5/m) /5 is nonincreasing in m > 0, and is equivalent to asking if i— )• R{^) is monotone 
increasing in > 0. 

For each of the listed r/, it is known that the corresponding ^ i— t- R{ii) is strictly monotone 
increasing in > 0. Details are provided in Appendix |Ij □ 
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4.2 State Evolution Phase Transition 

Consider a collection J-^^b of probability distributions over R^, indexed by e G [0,1] We will 
consider the following two properties: 

1. Fs^B is nested, i.e. Fs^^b ^ -^£2,-6 whenever ei < 82- 

2. Te^B is scale-invariant, i.e. if z/ € then any scaled version of v is also in Ts^b- 

In particular, the class J-e.B of probability measures v obeying the sparsity constraint v{{X = 
0}) > 1 — e (introduced already above) is both nested and scale- invariant. 

We temporarily use the notation HFP(5, r, i/) = HFP(^'( • ]5,t,u, B)), and consider the mini- 
max value: 

HFP*(e,5)= inf sup HFP(5,r,zy). 

TeR+ u(^TB,e 

Definition 4.3. For e G [0, 1], define the state evolution phase transition as 

6sE{e) = mf{d>0: HFP*(e, 5) = O}. 

Note that HFP*(e,(5) is monotone decreasing as a function of 5, by definition of the state 
evolution mapping ^. It follows that HFP*(e,(5) = for 5 > Sssi^)- Further, by nestedness, it is 
monotone increasing as a function of e, which implies that e 1— )• 5sEi^) is monotone increasing. The 



rationale for this definition is that, for 6 > dsEi^) and under any of the assumptions of Lemma 4.1 
state evolution predicts that AMP will correctly recover the signal xq. 

Theorem 4.1. Let J-B,e ^ nested, scale-invariant collection of probability distributions, and 
assume that the shrinker r]( ■ ■,T,a) obeys the scaling relation (4.1). Define the minimax MSE 



M(e) = inf sup ^eJ||X - + Z;r, 1)||2|, 



where Z ~ N(0,/5) is independent of X ^ i^. Then 

5sE{e) = M{e) . 

In order to prove this result, we shall first establish a more general fact. Given a distribution v 
over R^, and for any r £ G, we let 

^se{t, y) = inf {(5 > such that HFP((5, r, z^) = O} . (4.2) 

The rationale for this definition is clear. Under the assumption that state evolution holds, for a 
signal Xq sampled from block distribution u, AMP (with tuning parameter r) is guaranteed to 
reconstruct xq if and only \i 5 > Sse{ti ^'). 
Analogously, we let 

M(r, u) = sup -^^u{\\X - v{X + aZ; r, a)\\l]; (4.3) 

the normalized MSE for denoising with worst case signal. With these definitions we have the 
following. 
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Lemma 4.3. For any probability measure v over R^, and any t ^ Q, we have 

6se{t,u) = M{t,u). (4.4) 

Proof. By definition 

6se{t, 1^) = inf |(5 > such that ^'(m; 6, r, i^^B) < m for all m > o| 

= inf > such that sup — '^{m;6,T,L',B)<l\ 
I m>o fn > 

= inf|(5>0 such that sup ^E{||X - ry(X + ^^Z; r, ym7^)f } < 1 j 
I- m>o "mB ) 

= inf 1(5 >0 such that sup -\--E{\\X - r](X + aZ;T,a)f} < l\ 
^ a>o oa-'B J 

= inf jj > such that (r, i/) < l| = M (r, z^) . 

□ 



We are now in position to prove Theorem 4.1 



Proof of Theorem \4-l\ Define 6.t{e) = inf^ge sup^gjr^ ^ (5(t, v). We claim that 5*(e) = (5(e). Indeed 
choose (5 G [0,(5(e)). Then by definition there exists m > such that, for all r G 0, there exists 
G -^e,_B with HFP((5, r, z^) > m. Hence, for all r G ©, there exists v G J>,_b with (5(t, z/) > 5. 
This implies that, for all r G 0, sup^^gjr ^ 5(t, i/) > 5, i.e. (5*(e) > 5. Proceeding in the same way, 
it is immediate to prove that, for any 5 G [0, (5*(e)), we have (5(e) > 5. Hence, we conclude that 
(5(e) = (5*(e) as claimed. 



To conclude the proof, we note that, by Lemma 4.3, we have 



(5*(e) = inf sup M{t,u) 



inf sup sup —^E,yi\\X - ri(X + aZ;T,a)\\l\ 
r(^e ^(zjr^ g a>o Ba^ I J 

inf sup ^EJ\\X-rj{X + Z;T,l)\\l}, 



where the last equality follows from the scale invariant property of J-B,e- The last quantity is 
nothing but M(e). □ 

4.3 Non-Covergence of State Evolution 



By Definition 4.3, for all 5 < Sssi^), ah probability distributions v G and all initial conditions 

mo, state evolution converges to the zero error fixed point, namely lini^__^QQ mt = 0. Viceversa, for 
(5 > (55_B(e), there exists ly G J^B,e, and an initial condition uiq, such that limf_i.oo "ij = moo > 0. 
The reader might wonder whether this conclusion (nonconvergence to 0) also holds if we use the 
initial condition that is relevant for AMP, i.e. if, for any 6 > SsEi^), there exists v G J-^B,e such 
that, taking mg = B^^K^, we have limf__s.oo = ""^oo > 0. 
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Lemmas 4.1 and |4. 2| immediately imply that the answer is positive for soft thresholding, positive 
soft thresholding, block soft thresholding, and James-Stein shrinkage It turns out that the answer 
is positive also for firm thresholding and the global minimax denoiser. In Appendix [D] we describe 
the argument for these cases. 



5 Nonseparable Denoisers 



To some readers, the application of formula (1.6) to a wide range of separable or block separable 



shrinkers may seem like a "minor" extension. We now show that the formula can be applied also 
to nonseparable shrinkers. 

5.1 Monotone Regression 

Let M denote the cone of nondecreasing sequences: 

X = {XgR^ : xt+i>xt forantG{l,2,...,iV-l}}. (5.1) 

Let rj'^"™ denote the shrinker of monotone regression^ namely 

r?~(y) = argmin^g^ \\y-x\\l. (5.2) 

This shrinker is highly non-separable, as one can understand most clearly by studying the standard 
pool-adjacent- violators algorithm for solving it. 



In the compressed sensing setting, one can implement exactly the AMP iteration (1.2)-(1.4) for 
this shrinker; let's call the resulting procedure monoreg AMP. 

To discuss the phase transition properties, we must generalize the notion of sparsity to simplic- 
ity: in this case, we say that a monotone sequence is /c-simple if it has at most k points of increase. 
We can then consider a phase diagram with coordinates (e,5), where e = k/N is the simplicity 
fraction and 5 = n/N is the undersampling fraction. Once this convention is adopted, numerical 
experiments document a well-defined phase transition, however, it does not seem to follow any 
previously-documented theoretical curve. For example, it is definitely not the same as any of the 
transitions mentioned in the paper so far. 



To apply formula (1.6), we need to calculate M(e|MonoReg), which first of all requires clearly 



defining the notion! Our approach is explained in Appendix |Ej 
5.2 Total Variation Minimization 

Let rf"" denote the shrinker of total variation penalized least-squares |ROF92] . also called fused 



LASSO TSR+05 



rf''{Y) = argminxeRiv||Y-X||^ + A-ry(X), (5.3) 

N-l N-1 



1=1 i=l 



Here we introduce the notation Axi = Xj+i — Xi for the discrete derivative. There is an extensive 
literature devoted to solving this denoising problem; see for example |V096| . 
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Empirical PT of AI\^P Monotone Regression vs Prediction (1 .6) 
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Figure 10: Results for monoreg AMP. Here b = n/N undersampling fraction e = k/N generalized 
sparsity measure. Red - less than 50% fraction of correct recovery. Aqua - greater than 50% 
fraction of correct recovery. Curve - Minimax MSE curve 6 = M(e|MonoReg). 



In the compressed sensing setting, one can implement exactly the AMP iteration (1.2)- (1.4) for 
the TV denoiser. We will call the resulting procedure TV AMP. 

To discuss the phase transition properties, we again generalize the notion of sparsity to simplic- 
ity: in this case, we say that a sequence is fc-simple if it has at most k change points., and consider 
a phase diagram with coordinates {e,d), where e = k/N is the simplicity fraction and 6 = n/N 
is the undersampling fraction. To implement our formalism, we need to use TV-AMP with the 
tuning parameter A that is optimal for the specified e, which as in other cases requires a subsidiary 
computation. 

Numerical experiments display a well-defined and seemingly novel phase transition. 
To apply formula (1.6), we need to calculate M{e\TV); see Appendix |Fj 



6 Discussion 

6.1 Clarifications on AMP 

The article |DMM09] defined the AMP algorithm for general scalar (separable) nonlinearities, and 
studied it for three specific cases: soft thresholding, positive soft thresholding and capping. The 
more general case was studied in |DMM10l IBMlla| , 

However, the definition of the algorithm given in the introduction left out two details: 

• To set the threshold level we need an estimate of the noise level. For sparse sequences, the 
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TV Compressed Sensing: N=200 TV Compressed Sensing: N=500 




Figure 11: Results for TV AMP. 5 = n/N undersampling fraction e = k/N generalized sparsity 
measure. Red - less than 25% correct recovery. Blue - greater than 25% fraction of correct recovery. 
Curve ~ Minimax MSE curve 6 = M{£\TV). 



following estimate was proposed in |DMM09] : 

^* = ^3T^median{|z*|}, 

where ^{z) is the normal distribution function. This is known as the 25% pseudo- variance in 
robust estimation; it is insensitive to a small fraction of large outliers. 

• The formula for the working residual involves the scalar fi*; in the scalar separable case 
?? : R H> R it was defined in [DMM09j as 

n ^-^ 

i 

Note the shift in indices: at each iteration we compute the next iteration's scalar. In the 
more general nonseparable case the formula is 

I 

= - • div r]{y; r*"^) 

where, for / : R^ i— )• R^, div/ is the divergence div/ = "^^^i g^. Note emphatically that 
although we are dealing with A^-vectors, we divide by n. 



6.2 Phase Transitions for other Algorithms 



Formula (1.6) connects an algorithmic property - phase transitions of recovery algorithms 
a property from statistical decision theory - minimax mean squared error in denoising. 



with 
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We have not explored here a further connection, joining the behavior of convex optimization 
with that of certain AMP algorithms. As proved in [BMllbj . in the large system limit AMP with 
soft thresholding denoiser effectively computes the solution to 

minimize - ||y — AxU^ + A||x||i, x G . 

for an appropriately calibrated A = A(r). Similarly, AMP with positive soft thresholding denoiser 
is expected to solve 

(P^x) minimize - ||y — + A(l, x), x G R^, 

where ( • , • ) denotes the standard scalar product over R^, and 1 is the all-one vector. Finally, 
AMP with capping denoiser effectively solves 

(Pn) minimize — ^xH^, x G [0, l]'^. 

It follows that the AMP phase transitions are the same as the convex optimization phase transitions. 


More generally, consider a generalized reconstruction metod of the form 

(Pj) minimize -||y — ylx||2 + AJ(x), x G R^. (6.1) 

All of the above examples can be expressed in this form for a suitable convex regularizer J : R^ — )• 
R. To this we can associate an AMP algorithm, by using the denoiser r/"^( • ; r) in Eq. (1.2), (1.3), 
(1.4), whereby 

V'^{y;T)^aTg min \ l.\\y - x\\l + t J {x)\ . (6.2) 

(we also let r][ - ]T,a) = r]{ - ;t ■ a)). We will refer to this algorithm as to AMP- J. 

We then have the following general correspondence, which follows immediately by writing the 
stationarity condition of problem Pj = Pj{\) and the fixed points of AMP- J. 

Proposition 6.1. Any fixed point x°° of AMP-J with fixed point parameters Tqo, ^oo, Coo corre- 
sponds to a stationary point of Pj{\) with A given by 

A = rooO-oo(l - ^^oo) , f^oo = -divT/(y;T°°)L=j^oo (6.3) 

n 

In particular, if the regularizer J is convex, then fixed points correspond to minimizers. 



^ Previously, phase transitions for such convex programs were proven |Don061 IDTOQal IDTlOb ' and computed by 
techniques of combinatorial geometry; these were identified with phase transitions in random combinatorial structures, 
namely the neighborliness of face lattices of random polytopes. Thus, formula (1.61 connects fundamental problems 
of minimax decision theory to fundamental problems in high dimensional combinatorial geometry. 
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Notice that, for noiseless compressed sensing reconstruction, it is natural to consider the limit 
A —7- of the above problem, corresponding to 



minimize J{x) x G , 
subject to y = Ax . 

We wil next exemplify this correspondence for the examples examined in this paper. In the 
block-structured case of noiseless compressed sensing, we can compare blockSoft AMP to the fol- 
lowing convex optimization problem: 

M 

(P2,i) minimize ||6/ocfcm(3^)||2 ; 

m=l 

subject to y = Ax . 

The norm ||X||2,i = X]rri=i ll^^ocA;m(2;) II2 is called the mixed £2,1 norm. In the block thresholding 
tests of Section pi we also considered the empirical phase transition behavior of (-P2,i)- Figure 
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verifies that the phase transition occurs around the location predicted by (1.6), just as with block 
soft thresholding AMP. 

In the non-separable case of noiseless compressed sensing, we can compare monoReg AMP to 
the following convex optimization problem: 

(^mono) minimize (l,Ax) 

subject to y = Ax, Ax > 0, 



where Ax = (Axi, . . . , Axjv), Axj = (xj+i — Xj). Figure 13 verifies that the phase transition occurs 



around the location predicted by (1.6), just as with monoReg AMP, 



Finally, we can compare TV AMP to the following convex optimization problem: 

{Ptv) minimize TV{x) , 
subject to y = Ax, 



location predicted by (1.6), just as with TV AMP, 



where again TV{x) = X^i^i^ |Axi|. Figure 14 verifies that the phase transition occurs at the 



6.3 Interpretation as Penalized Optimization for Separable Denoiser 

Scalar soft thresholding 7]^°f^ can be interpreted as the solution of a penalized optimization problem 

(Pj) minimize -(y — x)^ + J(x) , 

where J(x) = r|x| (thus, we defined J to absorb the factor r). 

It turns out that a similar interpretation can be given for any scalar denoiser (and hence for 
any separable denoiser as well). In particular, th minimax and firm thresholding rules ry"'' and 
^firm^ • ; Ti, T2) are optimizers of nonconvex penalization schemes. 
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Figure 12: Results for (i-*2,i)- Here 5 = n/N is the undersampling fraction and e = k/N is the 
sparsity measure. Red - less than 50% probability of correct recovery. Green ~ greater than 50% 
probability of correct recovery. Used Sedumi solver. Curve separating success from failure is the 
Minimax MSE curve 5 = M(e|BlockSoft). 



Empirical PT of Convex Optimization IVlin I'dx subject y=Ax, dx>= 
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0.2 0.25 



0.3 0.35 



Figure 13: Results for {Pmono)- Here 5 = n/N is the undersampling fraction and e = k/N is the 
generalized sparsity measure. Red - less than 50% fraction of correct recovery. Aqua - greater 
than 50% fraction of correct recovery. Curve - Minimax MSE curve 5 = M(e|MonoReg). Compare 
Figure 10 
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Empirical PT of TV Minimization 
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Figure 14: Results for (Ptv)- Here 5 = n/N is the undersampling fraction and e = k/N is the 
generahzed sparsity measure. Red - less than 25% fraction of correct recovery. Blue - greater than 



25% fraction of correct recovery. Curve - Minimax MSE curve 6 = M(e\TV). Compare Figure 11 



We can derive the implied penalty J( • ) from the nonlinearity f]{-) hy observing that x + J'(x) = 
y at the solution x = r}{y). Defining the residual A{y) = y — r]{y)^ and noting that rf{y) + A{y) = y^ 
we observe that A(t/) and J'(x) are related through the change of variables 

j'iviy)) = A(2/). 

In other words, 

J{x) = / A{7]-^{u))du. 



Figure 15 shows the impUed penalties for minimax firm and global minimax shrinkage, respec- 
tively. In both cases, the optimal penalizations are nonconvex, in accord with the commonly- held 
belief that nonconvex penalization is "superior" to ii penahzation [CYOSllKXAHlOj . However, the 
optimal penalization is seemingly very close to the ii penalization, so that the prejudice in favor 
of nonconvex penalization must be re-examinedj^ 



An implied penalty can also be derived for block James-Stein shrinkage r]"^'^ . Due to the covarl 
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ance under rotation, the corresponding penalty only depends on the modulus of a; G R'^. Figure 
shows the implied penalty J{s\B) as a function of s = ||x||2- Namely, the penalty J{-\B) : R_|_ — t- R 
is such that, for y G R^, 



V'^^iv) = argmin min — x\\\ + J(||x||2|-B)| 



^The minimax nonlinearity is somewhat complicated to implement - see Appendix [b] Figure 15 suggests to us 
that among piecewise linear rules, close to the best phase transition behavior is obtained by a rule which obeys 
"niy) < J/ ~ const for large y, a possibility that is not explored in the firm family, or in any other conventional 
proposals. 
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Implied Penalties: Minimax Shrinkage Implied Penalties: Semisoft Shrinkage 

18 I , , 1 9 I 1 1 , , 

I e=0.01 




Figure 15: Implied Penalties J{x) for two shrinkers: (a) Globally Minimax Shrinkage; (6) Minimax 
Firm Shrinkage; at e S {0.01,0.025,0.05,0.10,0.25} only results for positive arguments x > are 
shown; results for negative arguments are obtained by symmetry J{—x) = J{x). 
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Implied Penalties in James-Stein Shrinkage 

3 1 1 1 1 




5 10 15 20 25 

s/(B-2)°'^ 

Figure 16: Implied penalties J{s\B) for James-Stein positive-part estimator for block sizes B € 
{5, 10, 20, 50}. Note that both axes are scaled with B dependent, so that all plots fit on a common 
scale. More precisely, the y-axis presents J{s\B)/B and the x-axis presents s/y/B — 2. 



coincides with the positive-part James-Stein estimator. The implicit penalization is again noncon- 
vex. 

6.4 Comparison to Traditional (5, p) Phase Diagrams 

In prior literature on phase transitions in compressed sensing, |DT10c| IDon06l IDT09al IBCTIH 
IDMM091 IDMMll"] . the authors considered a different phase diagram, based on variables 5 and 
p = e/5. The relation e = p-5 makes for a 1-1 relationship between the diagrams, so all information 
in the two diagrams can be presented in either format, however: 

• The (5-axis is the y-axis here, whereas in the other cited work, it was the x-axis. 

• The phase for successful recovery was below the {6,p{6)) curve in prior work, whereas here it 
is above the (e,5(e)) curve. 

As a result, the ideal transition (5(e|Ideal) = e in the current framework corresponds to p{5) = 1 in 
the prior framework. In the prior framework p = 1 corresponds to as many nonzeros as equations 
in the the underlying system of equations, and so corresponds to a natural limit on any procedure. 
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Translating back and forth between these two frameworks can be instructive. Our result on 
James-Stein shrinkage achieving the ideal MSE -M(e) = e in the limit of large B can be restated as 
saying that block James-Stein can recover block sparse vectors with essentially as many nonzeros 
as measurements, in the limit of large blocksize. 



6.5 Contributions of This Paper 

We list eight contributions, beginning with the two most obvious: 

• Application of the AMP framework to a wider range of shrinkers rj{-). Here we have imple- 
mented and studied AMP algorithms that don't correspond to convex optimization (eg the 
Firm and Minimax scalar shrinkers) and also those that don't correspond to scalar separable 
nonlinearities: both the block separable case (block soft thresholding and block James-Stein 
shrinkage) and the general inseparable cases. 

• A formula for phase transitions of AMP algorithms. As promised by the introduction, we 



see that formula (1.6) accurately describes the sparsity- indeterminacy combinations where 



AMP algorithms successfully recover a sparse signal from underdetermined measurements. 

A formula predicting the phase transitions of many convex optimization problems. Much 
work on compressed sensing establishes the possibility of recovery under sparsity by solving 
convex optimization problems. In most of the cases the trade-off between sparsity and un- 
dersampling is only known up to multiplicative constants. Quantitative guidance is provided 
by knowledge of the existence and location of phase transitions in the sparsity /indeterminacy 
plane. Unfortunately,considerable work was required to obtain those transitions for one con- 
vex optimization algorithm: ii minimization |Don061 IDT051 [DT09al IDTlOc] . The arguments 
needed to attack for example the block-sparsity case seemed to be quite different |Sto09j . In 
contrast, we have demonstrated here an approach which yields useful predictions in numerous 
cases, and possibly in many others. 

Reconstruction approaches not based on convex optimization, with sharp guarantees. We in- 
troduced three new AMP algorithms, based on Firm, Minimax, and James-Stein shrinkage, 
which are not derived from convex optimization. These methods have better phase transi- 
tions than the corresponding convex optimization problems, in their domains (e.g. Firm and 
Minimax outperform ii minimization, while James-Stein outperforms block soft shrinkage for 
large B). To the extent that these methods can be viewed as solutions of optimization prob- 
lems we show that the implied penalties are nonconvex. Nevertheless, these methods converge 
to the correct solution as long as we stay on the good side of the phase transition boundary; 
and in the interior of the success phase, these methods typically converge exponentially fast. 

Limited benefit of nonconvex optimization for ordinary sparsity. Within the class of scalar 
separable AMP algorithms, the best achievable phase transition is obtained by the minimax 
shrinker. Unfortunately the improvement in the transition is relatively small. This contra- 
dicts the hopes of many who have worked in this area, but not the actual results they have 
presented. 

Existence of algorithms for the block sparse case approaching "Ideal" behavior. While most 
attention in the group sparse case concerns block soft thresholding (and the corresponding 
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^2 — ^1 penalized regression, a.k.a. group LASSO |YL10) ). we show here that the phase 
transitions for block thresholding do not tend as S — >■ oo to the ideal transition, corresponding 
to as many nonzeros as equations. We also show that positive-part James-Stein shrinkage 
does tend to such an ideal limit. 

Identification of Combinatorial Geometry Phase Transitions with Minimax Mean-Squared Er- 
rors. The phase transitions for £i optimization, and for positivity-constrained ii optimization 



are determined by combinatorial geometry; see |DT09a| . By our general formula (1.6), these 
transitions are numerically the same as the minimax MSE in problems of scalar shrinkage, 
the empirical evidence given in the supplement of |DMM09] . reinterpreted in the light of 



(1.6) shows this. In the appendix, we also prove analytically that a similar relation holds for 
another phase transition of combinatorial geometry: the phase transition for uniqueness of 
nonnegative sparse solutions to underdetermined linear equations happens at the same place 
as the minimax MSE for the positive-part estimator. 

• Calculation of the minimax MSE of monotone regression and total variation denoising. We 
are not aware of any previous work deriving the minimax MSE of these denoising procedures 
under the condition of e-sparse first differences. We give here a derivation and show that it 
agrees with the phase transition of both AMP and convex optimization algorithms. 

Two conjectures flow naturally from this work: 

• State Evolution accurately describes the behavior of a wide range of AMP algorithms, for large 



system sizes N. As we have seen. State Evolution allows one to derive (1.6). In the case of 



^1 



penalization the correctness of state evolution as a description of AMP is proved by |BMllb| . 



Since formula (1.6) is apparently successful in some generality, this suggests the conjecture 



that state evolution applies much more generally than to the cases proven so far. 

The AMP framework applies to a much wider range of shrinkers rj{ ■ ) than proposed here. 
If true, this would provide a general tool in compressed sensing. If one knew that a certain 
shrinker would be appropriate for the given type of content in the presence of direct noisy 
observations, then for the compressed sensing setting, one would simply AMP-ize it. 



A Classical Cases 



The general formula 1.6 was already derived in the Online Supplement to |DMM09] . at in the 
cases of soft thresholding and positive soft thresholding. We review the arguments here since they 
provide a useful stepping stone for understanding more complicate cases (cf., e.g.. Section [4]). 

Denoting by r/( • ; r) either soft of thresholding or positive soft thresholding with threshold r, 
and taking the worst case over (respectively) v G Ti^e or v € -^i,£,+ '^^ have 

M{e\ri) = inf sup E{{r]{X + Z;t) - Xf} 
= inf sup E{{7]{X + Z-T)-Xf} 

= inf sup I (1 - e) E [(r/(0 + Z;t) - 0)^] + eE [{r,{^l + Z;t) - fx)^] } . 
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Here X ^ v and Z ~ N(0,1) are independent random variables. The reduction to two-point 
distributions follows from the representation of J^i_^ as a mixture of such two point distributions, 
along with linearity of expectation. Finally, the supremum is only over > because the support 
of u is in R+ for u G ^i,e,+ or because the risk is even in /x for the case of soft thresholding. Define 

M{fi,T;e) = {l-e)E[{rj{0 + Z;T)-Of] + sE[{r]{n + Z;t) - fxf]. 

Both for 77 = rj-'^^f^ the soft thresholding denoiser, for r) = ri^°f^P°'^ the positive soft thresholding 
denoiser, we have that fj, 1-^ M{fj,,T;£) is monotone increasing in > 0. Hence, in both cases, 

supM(/x, r; e) = lim M(/Lt, r; e) = M(oo, r; e), . 



It is easy to see that 
We also have 



M(oo,t; 1) = 1 + 



„ ( fr — t)'^(I)(z) dz soft thresholding, 

M(0,r;0) =E{r?(Z;T)2&i5} = <^ {.[--'^^' ' ,2;/;y 

^ ^ L /V ) / i-j I j^^^^ - rj^'^^z) dz positive soft thresholdmg. 



Here and below (j){z) = /^/\/27r is the standard normal density and = J^^(p{x) dx is 

the normal distribution function. Since we seek M{£\r)) = inf^gR, M(oo, r; e), we look for a global 
minimum of r 1— >■ M(oo, r; e). Now 

1^ M(oo, r; e) = (1 - e) • c • [-2<^(r) + 2t(1 - #(r))] + e • 2t, 

where c = 1 for positive soft thresholding and c = 2 for soft thresholding. We conclude that the 
derivative vanishes where 

i{r)^^-*(-r)-' ' 



T c 1 — e 

The function L is monotone decreasing on (0, 00], and so there is a unique solution T*{e) of this 
equation. The minimax risk is then given by 

M{e\r]) = M(oo,r* (£);£). 

As a last 'classical' example, we consider the capping nonlinearity r)'^°'P{x) = min(l, max(0, x)). 
Unlike all the other shrinkers considered in this paper, 77'^"^ is not scale- invariant. To correctly 
interpret this setting, we introduce the positive part nonlinearity r/'*"(x) = xl^^y^y, and consider 
v G J^i,e,+- Define 

M(£|r/+)= sup E{{r]^{X + Z) - Xf}, 

where X ~ and Z N(0, 1) arc independent random variables. Defining as above M{p;e) = 
(1 - e)E{ (r?+ (0 + Z) - 0)2} + e • E{ (r/+ {1^ + Z) - fif} , we again have sup^ M{fi;e) = M {00; e) (notice 
indeed that rj'^iy) = r]^"f^P"'^{y^T = 0)). We then have the explicit formula 

poo poo 

M(oo;£) = (1-e) / (t){z) dz + e j z'^(f){z)dz 
Jo J-00 
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whence 

M(e|r/+) = M(oo;e) = -(1 -e) + e, 

and in particular, 

limM(e|r/+) = J. 

Notice that, for all the other shrinkers considered in this paper, we instead have lim£_j.o M{e\ri) = 0. 
So sparsity is of limited value for this shrinker. 

The paper |DT10b| considered the problem of recovering a vector x > from measurements 
y = Ax, where ^ is an n x matrix with iid N(0, 1) entries, by solving the simple feasibility 
problem. The authors derived a curve separating a success phase, in which the feasibility problem 
has, with high probability, a unique solution, from a failure phase, in which the problem admits 
multiple feasible solutions. Their curve, derived in {S,p) coordinates (See Section 6.4) can be 
translated into (e, 5) terms to yield 

(5(e|PosOrthant) = ^ + < e < 1 . 



We have therefore the following rigorously- proven case in which the general formula (1.6) in- 
deed extends to yield equality between phase transitions in convex geometry and minimax risk in 
statistical decision theory. 

Corollary A.l. Let (5(e|PosOrthant) denote the position of the N -large phase transition for unique- 
ness of solution to y = Ax subject to x >0. Let AI{e\r]^) denote the minimax risk of the positive- 
part estimator r]^{y) = (y)+ over the class J^i,e,+ of sparse positive means. Then 

(5(e|PosOrthant) = ^(1 + e) = M(e|r/+), < e < 1. 

Further |DT10bj . derived a formal equivalence between the phase transition for the positive 
orthant problem just described, and the phase transition for the hypercube problem where one 
observes y = Ax, x G [0, 1]^ and the fraction of coordinates Xi in x such that Xj G (0, 1) is at most 
e. Hence 

5(e| Hypercube) = -{1 + e), < e < 1. 

Now define J~i,e,n = {i^ ■ supp(z^) € [0, 1], i^({0, 1}) > 1 — e}, and define the scale-invariant minimax 
risk 

M{£\7j'"'P)= sup sup ^E{{7]''''P{X + aZ)-Xf}. 

Notice that the supremum over a could have been introduced for the previous examples - soft 
thresholding and positive soft thresholding - as well. However, for scale invariant families of 
distributions, the two definitions give the same minimax risk. 
We then have 

M(£|??™P) = M(e|r/+) = -(1 + e), < e < 1. 



Then general formula (1.6) is again rigorously valid. 
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Corollary A. 2. Let (5(e|Hypercube) denote the position of the N -large phase transition for unique- 
ness of solution to y = Ax subject to < x < 1. Let M(e|ry'^'^*') denote the scale-invariant minimax 
risk of the capping nonlinearity r]^"'P(y) = max(0, min(l, x)) over the class J'i,e,n of distributions 
supported on [0,1]. Then 

(5(e|Hypercube) = ^(1 + e) = M(e|r/'=''P), < e < 1. 

B Calculation of Minimax MSE 

In this Appendix we describe the calculation of the global minimax risk over the class J-i^e- 

Throughout this section we let MSE(z/, t]) = E{(r/(A + Z) - A)^} where A ~ and Z ~ N(0, 1) 
are independent random variables. Using arguments of Bickel [Bic81| . and Casella and Strawderman 
|CS81j we have quite generally that, if v is allowed to range over a weakly compact and convex 
class J-" of probability distributions, then 

inf sup MSE{iy, r?) = 1 - inf 1(7 * z/), 

where 7 is the Gaussian measure ^•ku denotes the convolution 

of measures, and / denotes the Fisher information. For a probability measure z/j(dx) = /(x)dx, 
with density / with respect to the Lebesgue measure this is defined as 

I{^f) = p-^ix)dx. 

The distribution i'* minimizing the Fisher information (if it exists) is called the least-favorable 
distribution. Using arguments of Bickel and Collins |BC83j , we know that this distribution has the 
form of a mixture of point masses 

00 

i=— 00 

where Yli'^i = 1, Oi > and the sequence {/iijiez has no accumulation points. By [BicSl] the 
minimax nonlinearity is then of the form 

v*{y) = y- i^*{y), 

where ijj* is the so-called score function 

r{y) = ^iogf*{y), 

dy 

and /* is the density of z^* ★ 7. 

Focusing now specifically on the class J- = without loss of generality we can assume that 
the m are monotone increasing and that /io = 0, with ao = (1 — e). A conjecture of Mallows |Mal78j 
states that in fact we may take 

IJi = c-i = (1 - e) • eV2, ieZ\{0}. 
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In other words, the conjectured least favorable distribution has the form of a two-sided geometric 
distribution on a scaled copy of Z. While this conjecture is thought to be false |BC83j . we proceed 
as though it were asymptotically correct, i.e. a correct model for the behavior of /ij and Oi for large 
\i\. 

For estimating the minimax risk numerically, we chose a parameter K = 50 and assumed 
a generalized "Mallows form" for \i\ > K. More precisely, we assume an equispaced grid, and 
geometrically decaying weights. This is a little more general than what Mallows proposed, having 
3 total degrees of freedom (spacing, total weight and rate of decay), rather than one. For —K < 
i < K, we allow the parameters Qj and Hi to vary freely. In this way we obtained a parametric 
family ug of probability distributions, with parameter 6 = (e, (oj), (;Uj), cq, ci), with 

K 

k=\ k>K 

Define 

/+ = min{/(7 ue) : 6 e R^+^}. 

The quantity can be estimated numerically, and provides an upper bound on /*. In order 
to get a lower bound, we use Huber's minimax theorem |Hub64[ IHR09j . which tells us that 

— — = mt sup „ , (B.lj 

where in the supremum g ranges over all densities ^-kv with u G Ti^e and ranges over all functions 
differentiable in measure. Therefore, for any specific ij) let denote a least-favorable density. We 
then have 



I{l*iy*)>J{^,g) 



Let I/"*" denote the optimum of the parametric optimization and let denote the corresponding 
score function ^ 

i^^iv) = ^log/+(y), 

where = u'^ •k'j. Let g~^ denote a maximizing density g for J(ip~^,g); by Huber's theory, this 
can be chosen to be two-point mixture (1 — £)j{ ■ ) + £</>( • — fJ-~^) where yu"*" is chosen to achieve the 
worst case value on the right side of (B.l) and set /_ = J{'ip^ , g^). 

We have the bound /_ < /* < Such lower and upper bounds are, however, concerning 
exact integrals with exact optimization. Numerically, we compute integrals and extrema over fine 
grids with at least 100 samples per unit of range, getting not and /_ but instead numerical 
approximations I"*" and I_. 

Table [3] presents some information about numerical approximation results, which may help the 
reader assess its accuracy. 

Some minimizing distributions obtained in this way are shown in Figure 17 the mass points 



(/Uj) are displayed in Figure 18 



Our numerical results, showing that /_ ~ /+ allows us to infer that the "Mallows form" is 
approximately correct. The "minimax nonlinearity" we actually apply in our estimation and com- 
pressed sensing experiments is: 

v^{y) = y- 4^^{y) ■ 
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£ 


K 


1-/+ 


MSE*(^+,e) 


0.01 


2 


0.0533 


0.0533 


0.05 


2 


0.1841 


0.1841 


0.10 


2 


0.3026 


0.3026 


0.15 


2 


0.3983 


0.3984 


0.20 


4 


0.4802 


0.4803 


0.25 


4 


0.5516 


0.5516 



Table 3: Numerical Values of Lower Bound on Minimax MSE (\ — /^(e) ) and of Upper Bound on 
Minimax MSE. The upper bound is the numerically computed worst-case MSE of the corresponding 
decision rule 77"^. In these cases the bounds agree except possibly in the fourth decimal place, where 
the upper bound might be larger. 
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Figure 17: Least-Favorable distributions over the class Ti^e obtained by numerically minimizing 
Fisher information. 
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Figure 18: Locations of masspoints in approximate least-favorable distribution (//j) versus index i, 
at various e, Note the approximate linearity; least-squares linear fit has slope parameter given in 
legend. The 'wiggles' away from the linear behavior occur near the origin. 



C Calculation of Block Soft Thresholding Minimax MSE 

The argument for the 5-dimensional case is parallel to the scalar case (corresponding to = 1) 
treated in Appendix [a} Throughout this Appendix, we let r]{y;T) = ri^°f^{y;T) denote the block 
soft block thresholding denoiser • ; r) : — )• R^. We then have 

M{e\ri) = inf sup ^ E{ + Z; r) - } 

= inf sup ]--E{\\7]{X + Z-t) - Xf} 

= iiif sup (l-e)-^E{||r?(0 + Z;T)-Of}+e.lE{||r/(/x + Z;r)-/if}, 

where X ^ v and Z ~ N(0, 1) are independent, and the reduction to two-point distributions follows 
from the representation of J-b,£ as a mixture of such two point distributions, along with linearity 
of expectation. Defining 

Meifi, r; e) = {I - e) ■ -^E{ ||r?(0 + Z; r) - Of} + e • -^E{ + Z; r) - /if} , 

we note that t i— >• MSEsit ■ fj.,T;e) is monotone increasing in t > and in fact depends only on 
\\lJ,\\2- Hence 

sup MB{fJ.,T;e) = lim MB{fi,T; e) = Mb{oo,t; e) . 
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It is easy to compute 

Ms(oo,r;l) = 1 + ^• 

We also have 

1 1 1 l-co 

Mb{0,t;0) = -E{MZ;T)f} = -E{{./X^ - t)1} = -j^ {x^'^ - rf ^x) dx , 

where Xb denotes a chi-squared variable on B degrees of freedom, and /o,_b is the univariate 
density of a chi-squared variable on B degrees of freedom. The explicit form of the latter is 
fo,B{x) = x^/^~^e~^/^2~^/^r(i3/2)^^. Define the (complementary) incomplete a-th moments of 
the chi-squared Io^B{T,ct) = J!^ fo^B{x)dx; these are expressible by incomplete moments of the 
Gamma function. Then Mb{0,t;0) = {/o,B(r,l) - 2tIo,b{t,1/'2) + t'^Io,b{t,0)}/B. 
We seek to evaluate 

MB(e|BlockSoft) = inf MB(oo,T;e), 
TeR+ 

and note that 

MB(oo,T;e) = (1 - e)MB(0, r; 0) + eMij(oo, r; 1). 

In order to determine the minimum of r i— )• MS'E's (oo, r; e), we compute its first derivative and 
obtain 

d 

B—Mb{oo, t; e) = {!-£)■ {-2Io,b{t, 1/2) + 2r/o,ij(T, 0)) + e • 2r. 
We conclude that the derivative vanishes where 

_ /o,B(r,l/2) . . e 

J^B{T) = h,B[T, Uj — 



1-e 



The function Lb is monotone decreasing on (0,oo], and so there is a unique solution T^(e). The 
minimax mean square error is given by 

MB{e\7]) = MB{<x,T*B{e);e). 

C.l Asymptotics for Large Blocksize 

In this appendix we prove the limiting result in Lemma 3.1 

The scaling of the problem with B makes it natural to consider thresholds of the form Tc^b ~ 



c • Vis . We claim that the per-coordinate MSE with such a choice is 

lim MB(oo,Tc,B;e) = (1-e) - lim Mb(0, Tc,_b) + e lim M(oo,Tc,b) 

= (l-e)-(l-c)^+e-(l + c2). (C.l) 

Calculus shows that (1 — e)(c — 1)^ + e • (1 + c^) is minimized at c* = 1 — e. We conclude that 

lim MB(e|BlockSoft) = (1 - e)e^ + e • (1 + (1 - ef) , 

which coincides with the statement of Lemma 13. 1[ 
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In order to complete the proof, we sketch the derivation of the claim |C.1[ Of the two limits, 
the easy one is limB^oo Mb{oo, Tc^b) = 1 + c^- This is immediate because Mb{oo, t) = 1 + t"^ /B 
in every dimension. 

The limit lim^-i^oo Mb{0, Tc^b) = (c— 1)^ follows instead from the central limit theorem. Indeed, 
let Xq^b denote a central chi-squared with B degrees of freedom. Its square root is the norm of 
a standard Gaussian random vector in dimension B, and is effectively about Indeed by the 

central limit theorem \/2{y^XQB — VB) =^d N(0, 1) as — )• oc. So as i? — >• oc, 



Urn Mb{0,Tc,b) = lim ^E{{./X^-t,^b)1} 

B—>-oo B—^oo t) 

= lim E|(l -c + 

and the latter converges to {1 — c)\ by dominated convergence. □ 



D Non-Convergence of State Evolution 

In this appendix we show non-convergence of state evolution for firm-thresholding AMP and globally 
minimax AMP below Sssi^)- More precisely, we show that, for 5 < 6s Ei^) there exists a probability 
distribution v such that the state evolution sequence {'mt}t>o, rnQ = ¥,,^{X'^} does not converge to 
0. 

We begin by developing a lower bound that holds for all the shrinkers r] studied in this paper. 
Let Ve^^ denote the mixture (1 — £)6q + ew^ where uj^ is either (if B = 1), the Dirac mass at fi, or 
(if S > 1) the uniform measure on the unit ball of radius /j in R^. In both cases, let r*(e) denote 
the minimax tuning for r/. Here /i G R+ U {oo} can be either finite or infinite. 

Definition D.l. Let R{fi) = R{ijl,t) = MSE5(r/( • ); z^e^^j) denote the risk function of denoiser 
r]{-]T) with tuning r for the signal with distribution Ve,fi- We say that R is superquadratic on 
[0, IJ.q) if, for any ii G [0, /io), ^ 

i?(/x)> (-\ -Rifio)- (D.l) 

The next result shows that superquadratic behavior of the risk function implies non convergence 
of state evolution for signal distribution l'e,^l■ 

Lemma D.l. (State Evolution non- Convergence) Fix any t £ Q, and assume there exists 
HQ > such that: (i) The risk function R{n) = R{fi;T) is superquadratic on [0, /^o); (H) Ri^J-o) > 6, 
and (Hi) S > e. 

Let ^'(m) = ^'(m; 6, r, z^^^^, B). Then there is m^p > such that 

"T- > fnfp =^ ^'(m) > mfp . 
In particular, if m^p < mo = -B~^Ejy^ ^ = e/x^/S then, for all t>0, 

mt>mip>0. (D.2) 
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Proof. By the scaling relation (4.1) 



Define mfp = (/u//xo)^ and assume m > mip. This implies yil^/rn < jj,Q, and, since R is su- 
perquadratic by assumption, 

^(m) > y i?(/io) =(-)' = "^fp , 

V^/m u*/ Vun/ 



which concludes the proof. □ 

We also note that R need not itself be superquadratic, but merely dominate a superquadratic 
function, to get a similar conclusion. 

Corollary D.l. Suppose that, in the previous lemma, we merely assume that R{fi) is pointwise 
greater than a function r{fi) which is superquadratic on [0,/xo] and where r{fiQ) = 6. The exact 
same conclusion follows. 

We resort to explicit verification of superquadraticity. 
Numerical Observation. For rj G {77^, r/^^} , let T*{£\r]) denote the minimax-tuning for spar- 
sity e and fi*{e\rj) denote the least-favorable fi. The risk function R{fi) = R{p,T* {e\r])\r]) is su- 
perquadratic on {0, fi* {e\ri)). 

The underlying numerical study was as follows. For each shrinker, we took a fine grid of e and 



at each fixed e evaluated R{fJ,) on a fine grid of /i, checking the inequality (D.l). See Figures 19 
andEOl 

Finding. The State Evolution Phase Transition obeys SsEi^li]) — ^{^\v) ^or rj G {r/^, r/^^*^}. 

E Minimax MSE for Monotone Regression 

The continuous curve in Figure[TO]was generated by the straightforward device of generating random 
monotone objects with e = k/N fraction of change points. The "change heights" were chosen to 
be very large, and the change points were distributed uniformly at random. To understand the 
rationale for this choice, we describe an underlying theory. To the best of our knowledge the 
following evaluation of minimax MSE over e-simple sequences is novel. 



Throughout this section rj = rf^°^° denotes the monotone regression denoiser (5.2). For ^ S 
Ai C a monotone vector, MSE(r/, /x, a) denotes the corresponding mean square error in noise 
with variance a"^, i.e. 

MSE(7?, ^l, a) = E{ + aZ) - nf] , (E.l) 

where Z = (Zi, . . . , Z„) ~ H{0, Ij\fxN)- We also use the shorthand R{fJ.) = MSE(r/, mn, 1) for the 
risk at fi. We use two principles: 

• Monotonicity of the risk function. The function 1 1— )• R{t-fi) is monotone increasing in t E R+. 

• Fragmentation. The limiting risk limt_>oo Rit ■ ^) decouples into a sum of risks of smaller 
problems for reconstructing signals with no internal change points. 
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Superquadraticity of R(^) lor Minimax Semisoft Thresholding e= 0.1 
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Figure 19: Numerical Verification of Superquadraticity for Firm Shrinkage. The green curve depicts 
the quadratic function • (////x*)^. The vertical line depicts the position of fi*. The red curve 

depicts R{lJ-)- The fact that the red curve is above the green curve throughout the interval [0,//*) 



verifies the superquadratic condition (D.l). 



Superquadraticity o1 R(fi} for lylinimax Shrinl<age e= 0.10 



- Quadratic 




Figure 20: Numerical Verification of Superquadraticity for Minimax Shrinkage. The green curve 
depicts the quadratic function R{n*) • (/i//i*)^. The vertical line depicts the position of /i*. The red 
curve depicts R{f^)- The fact that the red curve is above the green curve throughout the interval 
[0,;U*) verifies the superquadratic condition (D.l). 
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The first principle is analogous to similar statements for positive soft thresholding, and the under- 
lying reason is the same. 

The second principle holds by the following argument. For a given monotone sequence jiAi, 

^gg(-^mono^ t/x, 1) = MSE(r?'"''"°, n, 1/t) ■ t^. 

This follows from the fact that the cone M. of monotone sequences is positive homogeneous, so we 
can equivalently consider scaling up the object /i or scaling down the noise a and normalizing. Take 
the second viewpoint (i.e. the right hand side in the above equation). Introduce the optimization 
variable v = {x — Since the noise vector is z = {y — n)/a, the monotone regression problem 

is equivalent to the following 

minimize \\v — 2;||| , 

subject to Avi > — — A/x^, for alH G {1, . . . , iV — 1} , 

(T 

where we recall that Avi = Vi-^-i — Vi is the discrete derivative. (Of course this problem does not 
provide an algorithm since it requires to know A/Xj, but here we are interested in it only for analysis 
purposes.) 

As (T — >■ 0, all the constraints Avi > — -A^j for which Ajii > become irrelevant. Let Iq denote 
the set of sites i G {l,...,iV — 1} where /Xj+i = mui. We are naturally led to defining the following 
localized monotone regression problem 

(Qlmono) minimize |b - ^^HL 

subject to Avi > 0, for all i E Iq ■ 

Let rl"^°^^°{z; Iq) denote the solution of (Qimono) with data z, Iq) . When z ^ N(0, 1), we have the 
following limit in probability: 

lim i (r/~(/i + az) - n) = r7'~(z; h). 

As a consequence, defining B!'°^{Iq) = E,{\\r]'''^°^° {z\ Iq)!!!}) we have 

lim MSE(r/~,/x, 1/t) • = R^^Vo), 

t— )-oo 

and therefore 

lim R{t- fi) =R^°^{Io). 

In words, the worst-case risk of monotone regression is simply the local risk. 

Now the set Iq is a disjoint union Io,i U /o,2 U • • • U Iq^k of contiguous fragments, /o,fc = 
[uk-i -|- 1, . . .nfc], where is the last member of the k-th fragment. Mirroring this fragmenta- 
tion, the problem (Qimono) separates into independent optimization problems. Define the fragment 
optimization problem 

{Qlmono,i) minimize '^{zi - Vif , , 

subject to Avi > 0, if both i + 1 G /, 
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m| versus Iragmenl length I 



log(m|) vers J s log Iragment length I 




Figure 21: Fragment MSEs for Monotone Regression. Left panel: small £ ; Right panel: large 
i. Left panel shows mi versus £, i = 1, . . . ,20; Right panel shows log(m£) versus log(^) for i in 
5, 10, 20, 50, 100, 200, 500, 1000. The mi were obtained by Monte-Carlo simulation. 



where, if the fragment / is a singleton, the constraint disappears. Then the value of the overall 
optimization problem is the sum of the values of subproblems 

K 

Val(Qimono) = Val((5;mono,/o,fe)) 
fc=l 

and the solution is the concatenation of subproblem solutions 
Clearly 

K 

R'''Vo)=J2^{\\sol{Qlmono,Io,J\\l}- 



k=l 

|2 



On the other hand, IE||sol(Q;mono,/o t)!!! clearly depends only on the cardinalilty of Iq^^- Hence, 
abuse notation by writing Qimono/ for an optimization fragment of cardinality ^, and define 

m£ = ■ E{||sol(Qimono,£)|li} • 



Note in particular that mi = 1 and m^ is monotone decreasing in £. Figure 21 , left panel, shows 
that for £ ranging from 1 to 20, the MSE is decaying rapidly. Figure 21 right panel, shows that 
for £ ranging from 5 to 1000, we have the approximate power law 

log(m^) ^ 0.61 -0.78 log(^), 

where the rational exponents 3/4 and 4/5 seem plausible for the underlying asymptotic relation. 

Let 7r£ denote the fraction of fragments having a given length, i.e. vr^ = K~^^{k : |/o,fc| = 
For the per-coordinate MSE we have 
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In short, the worst-case MSE depends only on the fragmentation characteristics and the behavior 
of the mean-squared solution of some standard problems. 

In the simulations underlying Figure [ToJ the change points were placed uniformly at random, so 
the fragment lengths are approximately iid exponential random variables with mean N/ {k + 1) k, 
1/e. Hence we used 

M(e|MonoReg) = ^vrfm^^, 
where the weights 7r| follow the geometric distribution: 

The preceding theoretical discussion makes clear why we chose to have very large change sizes 
at the change points (in practice change sizes larger than 3 are sufficient). It does not makes clear 
why we considered change points distributed at random. 

In fact we have only been considering M(e|MonoReg), the minimax risk over objects with 
randomly- distributed change points. We could have considered instead M*(e|MonoReg), the mini- 
max risk over objects with least-favorable change points. 

Lemma E.l. The least-favorable fragmentation is one which achieves the following supremum: 

M*(e|MonoReg) = sup{^ Tr^m^^ : ^^vr^ = 1/e}. 

e e 

Equivalently, the curve (1/e, M*(e|MonoReg)) is the least concave envelope of 

ii,mei), i=l,2,... 



Figure 22 shows that the concave envelope is apparently the interpolating linear spline. It follows 
that the least-favorable case has all fragments of equal length k/N. The geometric distribution is 
therefore (slightly) more favorable than the least-favorable change-point distribution. 

F Minimax MSE for Total Variation Minimization 



The curve M{e\TV) in Figure 11 was generated in a fashion paralleling the monotone regression 
case. We generated random piecewise constant objects with e = k/N fraction of change points. 
The "change heights" were chosen to be very large and randomly signed, and the change points 
were distributed uniformly at random. 

The theoretical motivation is similar to the case of monotone regression, with additional features: 

• the addition of a tuning parameter r, 

• the addition of boundary conditions into the fragment optimization problems. 

Skipping all the preliminaries, for i > 1 we define four fragment optimization problems, corre- 
sponding to different boundary conditions: no boundary conditions: 

(QuvAOfi) minimize { 2 X]^^^ ~ ^''>^ + X] ~ "^^l } • 

i=l i=l 
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Figure 22: Fragment MSEs for Monotone Regression against reciprocal variable e = Xji. Plot 
combines the asymptotic law mn = exp(0.61)^~^ ''^ for £ > 20 with direct simulation for £ < 20. 
Note the apparent concavity. 



Even-parity boundary conditions: 

{Qttv,e,+,+) minimize |^ ^^^(2:^ - Vjf + Xy^J\vi+i - Vi\ + X{vi + ve)^ 
Odd parity boundary conditions: 

{Quv/,+ -) minimize |^ y^(^i - Vjf + X^^Jvi - Vi-i \ + Xjvi - 

i=l i=2 

Boundary conditions at one end: 

{Qhv,ifi,+) minimize {3 ~ + - + Xvi> | 

i=l 1=2 

If ^ = 1, we have simply 



£-1 



1=1 i=l 



while 



(<3zii>,£,+,+) minimize { ^(^^i - ^^i)^ + 2Aui| , 
(<3zto/,o,+) minimize \^[zx - v^f + Xv^ . 



iQitvA+,-) minimize ^{zi - vif , 



and 

(QitvAOfi) minimize ^(2:1-^1)^ 
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Define = • E{||sol(Q^,^_^ for s G {00,0+,++,+-}. 

We let F = (I, s) denote a fragment, where I C {1, . . . , A^} and s G {00, 0+, ++, H — }. 

For each sequence //, the hst of fragments {{Ik, Sk)} is simply the list of pieces (subintervals of 
constancy) Ik, decorated as follows. First, suppose that piece I^ does not meet the endpoints 1 and 
N. If /i is a local extremum at this piece, we pick Sk = ++. If is not a local extremum, we pick 
Sk = -\ — . If /fc nieets the endpoints 1, A'', we have Sk = 0+ or 00 depending on whether Ik meets 
both endpoints or not. With these conventions, let J = {Fi, . . . ,Fx) denote a list of fragments; 
the local MSE conditional on the fragmentation is: 

i?''"=(J,A) = J]7r,,,m,y. 

We then have the random- changepoint minimax MSE 

M(e|rF) = mm^^,%m,V 

where 7r| ^ is the set of probabilities of fragment types induced by random changepoint assignment 
with a fraction of changepoints e = k/N . We also have the least-favorable changepoint minimaxMSE 

M*{e\TV) = mm sup I ^ Tr^^smlj : = ^ ^ tt^,, 
[ e,s e,s 

where the supremum is over all probabilities on the collection of decorated fragments, having mean 
length 1/e. 



G Computational Details 

Because this paper makes a heavy reliance on numerical evidence, we record here a number of 
details which may help the reader of the paper. 

• Environment. All computations were done in Matlab running under Mac OS X. In the spirit of 
reproducible research, we intend eventually to make the code available for interested readers. 

• Numerical Computation of Minimax Risks. For soft, hard, and firm thresholding, there 
are closed- form expressions for the MSE at a given mixture of point masses, in terms of 
fa y^4>{y) dy, where 4> denotes the standard normal density. Compare Appendix Xj For min- 
imax thresholding, one must integrate on a fine grid; we used standard numerical integration 
routines in Matlab. 

For block thresholding, there are closed-form expressions for the MSE at a given mixture of 
spherical masses in terms of incomplete beta functions. Compare Appendix [C| 

For the worst-case MSE over a class of probability measures, one must in principle maximize 
the MSE over all admissible measures. For the scalar- and block- separable smoothers con- 
sidered this means maximization over two-point mixtures, (1 — e)(5o + 5^, which ultimately 
requires maximization of MSE(?7,/i) across all ^. In some cases (Soft, BlockSoft, JamesStein), 
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the extremal ^ is "at infinity" ( a formula can be given in each case for the MSE "at infinity"); 
in other cases the least-favorable fi is finite and must be approximated by maximization over 
a fine grid. Generally speaking we performed grid searches by iterative subdivision: replace 
the interval in question into a finite grid of points (eg. 100), then evaluate by brute force 
on this grid and identify a subinterval with grid boundary points which contains the discrete 
optimum. Then replace the original search interval by the newly-found subinterval. 

For the nonseparable cases, the process we followed for numerically approximating minimax 
MSE was explained in Appendices |E] and [F| 

Monte Carlo Simulations. For scalar separable thresholding (soft, hard, and firm thresholding, 
and minimax shrinkage), we ran Monte Carlo simulations with the following parameters: 1000 
Monte Carlo repetitions at each parameter combination; system sizes A'^ = 1000, 2000, and 
4000. We varied the sparsity parameter e G {.01, .05, .10, .15, .20.25} and, at each fixed 
sparsity level we varied the undersampling parameter 6 around the predicted phase transition 
level, according to M(e|?7) {-.015, -.010, -.005, 0.00, .005, .010, .015} with M{e\T]) being 



the minimax MSE; thus we are choosing parameters in the region where the formula (1.6) 
predicts a transition. 

We defined success according to whether or not 



|™0 „300||2 

' "2 < 0.01. 



\x 



0||2 
l2 



• Monotone sequences. For monotone regression, we used the pool-adjacent-violators algo- 
rithm. For determining the AMP-mono phase transition, we ran Monte Carlo simulations 
with the following parameters: 1000 Monte Carlo repetitions at each parameter combina- 
tion; system sizes N = 200 (only, because of relative computational slowness). We varied 
the sparsity parameter e G {-01, .05, .10, .15, .20.25} and, at each fixed sparsity level we var- 
ied the undersampling parameter S around the predicted phase transition level, according to 
M{£\mono) + {-.015, -.010, -.005, 0.00, .005, .010, .015} with M{e\mono) being the worst- 
case MSE as discussed in Appendix [E} 

We found that it was necessary to define success differently in this experiment than it had 
been defined for separable nonlinearities. We fund that the relative MSE did not exhibit 
a sharp 'first order' phase transition, but rather seemingly a second order transition, which 
seems hard to locate precisely. We found that the success measure 

iV-i#{i : \xo{i) - a;30°(i)| > 0.01} < 0.01 

gave relatively sharp phase transitions which could be located precisely. 

To obtain the phase transition for the convex optimization problem {Pmono) we solved {Pmono) 
using CVX. 

• TV Minimization. For TV minimization, we used the software package tvdip. We considered 
many other software packages but found that they were not adaptable to the 1 — d case or 
else not reliable. We ran Monte Carlo simulations with the following parameters: 50 Monte 
Carlo repetitions at each parameter combination; system sizes N = 200 (only, because of rel- 
ative computational slowness). We varied the sparsity parameter e G {.01, .05, .10, .15, .20.25} 
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7 




1/3 


0.99521 


1/2 


0.98724 


2/3 


0.9666811 


3/4 


0.950506 


1 


0.893454 



Table 4: Powers 7 and resulting for fitting Model (H.l) 



and, at each fixed sparsity level we varied the undersampling parameter 6 around the predicted 
phase transition level, according to M{e\TV)+{-m5, -.010, -.005, 0.00, .005, .010, .015, .020, .025} 
with M{e\TV) being the worst-case MSE as discussed in Appendix [E} 

As in the monotone regression case, it was necessary to define success differently here than 
for separable nonlinearities. We again used the success measure 

N~^#{i : \xo{i) - x^°^{i)\ > 0.01} < 0.01. 
To obtain the phase transition for the convex optimization problem (Ptv) we used CVX. 



H Finite- Scaling 

The empirical phase transitions observed in the scalar separable case admit further analysis, to 
study whether (a) the offsets tend towards zero with increasing N (as we predict) and whether (b) 
the steepness of the phase transition increases with increasing N. 



H.l Offsets Decay Toward Zero 

As described in the text, at each fixed value e of the sparsity parameter, we gathered data at 
several different values of 6, and obtained the empirical phase transition parameter PT(A^, e, rj) 
(say), recorded in units of offset from prediction, so that FT{N, e,rj) = means that the 50% 
success location fitted to the e-fixed, (5- varying dataset is exactly at the predicted location M(e|?7). 
Our analysis game not only the empirical phase transition location, but also its formal standard 
error SE(PT). 

We fit a linear model to the dataset, considering a range of powers N'"' that might be describing 
the decay of the offset with increasing A^: 

PT{N,e,rj) = c(e, ?7)A^-t + 5S(Fr) • N(0, 1), (H.l) 

Table shows that 7 = 1/3 provides an adequate descripti on o f the offsets, with an exceeding 



0.995. A plot of raw PT's versus the predictions of model (H.l) is given in Figure 23 



H.2 Transitions Sharpen 

In addition to an empirical phase transition parameter PT(N,e,ri) we also fitted an empirical 
steepness parameter /3(A^, e), according to the logistic model: 

logit{SuccessProb{N, e, 6, r/)) = P{N, e, t]){6 - PT{N, e)) 
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Empirical Pliase Transition Offset vs Predicted Offset 



FFp f^sS 
S 



s 

s s 



0.008 0.010 
Predicted Offset 



Figure 23: Fitted Offsets versus Predicted Offsets in model (H.l) 
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1/3 


0.99286 


1/2 


0.99927 


2/3 


0.991064 


3/4 


0.982350 


1 


0.947641 



Table 5: Powers 7 and resulting R for fitting Model (H.2) 



Here we expect /? to grow with increasing N, corresponding to increasingly abrupt transitions from 
complete failure at 5 <^ PT{N, e)) to complete success at 6 ^ PT{N, e)). We fit a linear model to 
the /3 data considering a range of powers that might be describing the growth of the steepness 
with increasing N: 

^{N, 6,7]) = c{e,r])N'' + Error, (H.2) 

(We did not do a weighted least squares fit here. Although we should have done so, it will be 
evident that the results probably wouldn't much change had we done so). 

Table [5] shows that 7 = 1/2 provides an adequate description of the steepnesses, with an 
exceeding 0.999. A plot of raw /3's versus the predictions of model (H.2) is given in Figure 24 



I Risk Function Properties 

In this appendix we prove several useful properties of of the block soft and James-Stein denoisers. 
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Actual Steepness vs Model Predictions 
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m 
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Predicted Steepness 



Figure 24: Fitted steepness versus Predicted steepness in model (H.2) 



I.l Block Soft Thresholding 

Lemma I.l. For block soft thresholding the risk function /i i— )■ R{p,) has these properties: 

/i — )• R{fi) is monotone increasing, (I.l) 

< 1> (1-2) 

R{fi) <mm{R{0) + fi,B + t'^}, (1.3) 

R{fi)^B + t^ fi^oo. (1.4) 

Proof. It will be convenient within the proof to set d = B and ^ = /x^. Let S*^ = ||1^||2) so that S"^ 
is distributed non-central xl(0) ^'^d so ES"^ = ^ + d. As the block soft thresholding rule is weakly 
differentiable, Stein's unbiased estimate of risk yields R{pL] r) = 'EyU[S), with 



U{S) 



S'^-d S <T, 

d + T^ -2{d-l)TS-^ S>T. 



To derive (I.l) and (1.2), we use the fact that S ~ XdiO ^ non-central chi-square distribu- 



tion and the corresponding density function f^^d{w) satisfies 



dw 



f^,d+2{w) = f[f^,d+2{w) - f^,d{w)]. 



Applying the first identity, integrating by parts, canceling terms, and then using the second identity, 
we obtain 

— i?(e, r) = f^,d{w) dw + {d- l)r J ^ w-^'^ f^^d+2{w) dw > 0. (1.5) 
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[When d = 1, this reduces to the derivative formula for scalar thresholding.] For the upper bound 



(1.2), use the Poisson mixture representation f^^d+2iw) = Yl'o^ P£./2fd+2+2j{'w) and an identity for 
the central density family to obtain 

fd+2+2j{w) _ fd+2j{w) ^ fd+2j{w) 



W 



d + 2j 



and so to conclude that f^^d+2{w) < {w/d)f^^d{w)- 

Hence the second term in (1.5) is bounded by (1 — 1/r) dw, whence follows the 

conclusion (dr/d^) < 1. Property (1.3) is immediate from (I.l) and (1.2) and the large-^ limit of r. 



The risk function at zero can be evaluated exactly via 



iVnj - rffdiw) dw = Y, CkFd+k{T 



for Co = r2,ci = -2^/2T{{d + l)/2)/r(d/2) and C2 = d. To obtain the bound in (|L4|, write the 
risk function using the unbiased risk formula as 

R{^^L) =i + E[2d + - 52 - 2{d - l)r5-\ S > t] 

and observe that the integrand is decreasing in S, and bounded above by 2 for S > r, so that 

□ 



i?(0) < 2P(x^ > r2). This yields (1.4) 



1.2 Positive-Part James-Stein Shrinkage 

Let ei denote the vector (1,0,..., 0) in R'^ and Y ~ N(^ • ei,Idxd) and put S 
the noncentral chi-squared distribution S ~ XdiO with noncentrality = iJ?. 
Let r]'^^{x) = (1 — (d — 2)/S)j^ ■ Y. Stein's unbiased estimate of risk is 



|y||2. We have 



Hence, 



[7(5) 



S-d S <d-2, 

d-{d-2f/S S>d-2. 

R{fi) = EU{S). 



1.2.1 Risk at 

For central chi-squared on d degrees of freedom, the density satisfies wfd{w) = dfd+2{w). With 
Fd{x) = P{xl>x), 



R{0) = e| 



1 



d-2, 
S ' 



Y 



{w - 2{d-2) - ^^^\jdiw)^w 

d-2 W ' 

dFd+2{d - 2) - 2(d - 2)Fdid -2) + {d- 2)Fd-2{d - 2). 



58 



With D = d — 2, we can rewrite the second hne as 

R{0) = D-'E{xl - D)l. 
We have the convergence in distribution, 

adding a tightness argument (not given here) one can show that 

R{0;d) ^2EZl = l, d ^ oo, 
where Z ~ N(0, 1). Numericahy R{0; d) ^ 1 + 0.752/ Vd for large d. 

1.2.2 Monotonicity of 

Here and in the next subsection we use the variation-diminishing version of total positivity: 
Theorem I.l. The non- central family is strictly variation diminishing of all orders 
See [B.TMSlj . 

Let S~{g) and S^{g) denote the number of sign changes and strict sign changes of function g 
on [0, oo), and let IS^{g) and IS^{g) denote the initial sign (at 0). Given a function g : R+ i— R^., 
define function 7 : R_|_ 1— R+ by 

7(0 = j giw)fd,^iw)dw . 

The SVR property says that 5'"'' (7) < S^{g) and that if S'^{'j) = S^{g) then necessarily IS^{'~f) = 
IS-{g). 

Now verify that the risk R{fj,) of rj'^^ is monotone increasing in /i > 0. Put G{^) = K^(,iU{S). 
Since s — )• ?7(s) is strictly increasing, it follows from SVR2 that so also is G{6.)- Indeed, consider 
g = U — a for all constants a, and infer that the number of sign changes in the corresponding 7 is 
not larger than the number in g. But j = G — a, and so G is monotone. Finally: G(0 = R{n) 
where ^ = //^. 

1.2.3 Superquadraticity of R{fi). 

Lemma 1.2. For each a > 0, the function fj? — )• R{fi) — afj? has one strict sign change. 

This implies superquadraticity: given fiQ, set a = i?(/io)/^o- Obviously -R(/x) — a//^ changes 
sign at but the lemma says this is the only change. At the same time, we know R(0) > by a 
previous subsection. Hence, R{fi) — a^^ > for < /^o, and so 

R{iA > (—) Rii^o), o<fi<fio. 
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Proof. Since E[5 — d] = ^ = ^u^, we have 



with 



^ \d-{d-2)^S-^ -a{S-d) S>d-2. 



(1.6) 



By inspection of the function g in (1.6), we have 

'2 a<l, IS-{g) = - 



S-{9) = 



1 a>l, /S'-(5) = +. 



Note that 7(0) = i?(0) > 0, so /S'+(7) = +. Consequently S^{'^) < 1 in both cases. Then from 



behavior at +00 it foUows that S'~*'(7) = 1. 
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